Model Card: Exoplanet Candidate Classifier (Stacking Ensemble)

Model Details

Model Description

This is a robust machine learning pipeline designed to classify Kepler Objects of Interest (KOIs). It determines whether a detected signal represents a real exoplanet or a false positive.

The model utilizes a Stacking Ensemble architecture, combining the predictions of three powerful gradient boosting frameworks (LightGBM, XGBoost, and CatBoost) and aggregating them using a final LightGBM meta-learner. It is specifically engineered to handle missing data (NaNs) in scientific datasets through a dual-strategy imputation pipeline.

Developed by: [Darwin Danish]
Model Type: Scikit-learn Pipeline (StackingClassifier)
Input: Tabular data (16 astrophysical features)
Output: Multi-class classification (CANDIDATE, CONFIRMED, FALSE POSITIVE)

Model Sources

Repository: https://huggingface.co/DarwinDanish/exoplanet-classifier-stacking
Dataset Source: NASA Kepler Object of Interest (KOI) Table

Uses

Direct Use

This model is intended for astronomers, data scientists, and space enthusiasts who want to analyze Kepler mission data or similar photometric datasets. It predicts the "disposition" of a celestial object based on its physical properties.

Supported Features (Input)

To use this model, your input DataFrame must contain the following columns:

Critical Features:

koi_period: Orbital period
koi_depth: Transit depth
koi_prad: Planetary radius
koi_sma: Semi-major axis
koi_teq: Equilibrium temperature
koi_insol: Insolation flux
koi_model_snr: Signal-to-Noise Ratio

Auxiliary Features:

koi_time0bk, koi_duration, koi_incl, koi_srho, koi_srad, koi_smass, koi_steff, koi_slogg, koi_smet

How to Get Started with the Model

You can load this model directly from the Hugging Face Hub using joblib and huggingface_hub.

1. Installation

pip install huggingface_hub joblib pandas scikit-learn lightgbm xgboost catboost

2. Python Inference Code

import joblib
import pandas as pd
from huggingface_hub import hf_hub_download

# 1. Download the model and label encoder
repo_id = "DarwinDanish/exoplanet-classifier-stacking"

model_path = hf_hub_download(repo_id=repo_id, filename="exo_stacking_pipeline.pkl")
encoder_path = hf_hub_download(repo_id=repo_id, filename="exo_label_encoder.pkl")

# 2. Load the artifacts
pipeline = joblib.load(model_path)
label_encoder = joblib.load(encoder_path)

# 3. Create sample data (Example: A likely planet candidate)
# Note: The model handles NaNs, so missing values are allowed.
data = {
    'koi_period': [365.25],
    'koi_depth': [1000.5],
    'koi_prad': [1.02],   # Earth radii
    'koi_sma': [1.0],     # AU
    'koi_teq': [255.0],   # Kelvin
    'koi_insol': [1.0],
    'koi_model_snr': [35.5],
    # Aux features (can be mostly defaults or NaNs)
    'koi_time0bk': [135.0],
    'koi_duration': [4.5],
    'koi_incl': [89.9],
    'koi_srho': [1.0],
    'koi_srad': [1.0],
    'koi_smass': [1.0],
    'koi_steff': [5700],
    'koi_slogg': [4.5],
    'koi_smet': [0.0]
}

df_new = pd.DataFrame(data)

# 4. Predict
prediction_index = pipeline.predict(df_new)
prediction_label = label_encoder.inverse_transform(prediction_index)
probabilities = pipeline.predict_proba(df_new)

print(f"Prediction: {prediction_label[0]}")
print(f"Confidence: {max(probabilities[0]):.4f}")

Training Details

Training Procedure

The model was trained using a robust preprocessing pipeline followed by a Stacking Classifier.

1. Preprocessing

The pipeline splits features into two groups with different imputation strategies:

Critical Features: Missing values filled with constant -999. Scaled via StandardScaler.
Auxiliary Features: Missing values filled with the median. Scaled via StandardScaler.

2. Architecture

Level 0 (Base Learners):
- LightGBM: (500 estimators, GPU accelerated)
- XGBoost: (500 estimators, Histogram tree method, GPU accelerated)
- CatBoost: (500 estimators, Depth 8, GPU accelerated)
Level 1 (Meta Learner):
- LightGBM: (200 estimators) - Aggregates the probabilities from Level 0 to make the final decision.

Feature Importance

Based on the base learners, the most critical features for classification were identified as:

koi_model_snr (Signal-to-Noise Ratio)
koi_prad (Planetary Radius)
koi_depth (Transit Depth)
koi_period (Orbital Period)

Evaluation Results

The model was evaluated on a held-out test set (20% of data) using stratified splitting.

Accuracy: ~90%+ (Dependent on specific test split)
Precision/Recall: High precision in distinguishing False Positives from Candidates.

Bias, Risks, and Limitations

Data Specificity: This model is trained specifically on Kepler mission data. It may not generalize well to data from TESS or JWST without fine-tuning, as the instrumentation and noise profiles differ.
Class Imbalance: Depending on the dataset version, "False Positives" are often more numerous than "Confirmed" planets, which can bias the model slightly toward false positive predictions in low-SNR ranges.

Environmental Impact

Compute: Trained on GPU (NVIDIA T4/P100 class) via Kaggle Kernels.
Training Time: < 5 minutes due to GPU acceleration and efficient gradient boosting implementations.

Downloads last month: -

Space using DarwinDanish/exoplanet-classifier-stacking 1

Evaluation results

Metadata error: specify a dataset to view leaderboard