Model Card: Exoplanet Candidate Classifier (Stacking Ensemble)
Model Details
Model Description
This is a robust machine learning pipeline designed to classify Kepler Objects of Interest (KOIs). It determines whether a detected signal represents a real exoplanet or a false positive.
The model utilizes a Stacking Ensemble architecture, combining the predictions of three powerful gradient boosting frameworks (LightGBM, XGBoost, and CatBoost) and aggregating them using a final LightGBM meta-learner. It is specifically engineered to handle missing data (NaNs) in scientific datasets through a dual-strategy imputation pipeline.
- Developed by: [Darwin Danish]
- Model Type: Scikit-learn Pipeline (StackingClassifier)
- Input: Tabular data (16 astrophysical features)
- Output: Multi-class classification (CANDIDATE, CONFIRMED, FALSE POSITIVE)
Model Sources
- Repository: https://huggingface.co/DarwinDanish/exoplanet-classifier-stacking
- Dataset Source: NASA Kepler Object of Interest (KOI) Table
Uses
Direct Use
This model is intended for astronomers, data scientists, and space enthusiasts who want to analyze Kepler mission data or similar photometric datasets. It predicts the "disposition" of a celestial object based on its physical properties.
Supported Features (Input)
To use this model, your input DataFrame must contain the following columns:
Critical Features:
koi_period: Orbital periodkoi_depth: Transit depthkoi_prad: Planetary radiuskoi_sma: Semi-major axiskoi_teq: Equilibrium temperaturekoi_insol: Insolation fluxkoi_model_snr: Signal-to-Noise Ratio
Auxiliary Features:
koi_time0bk,koi_duration,koi_incl,koi_srho,koi_srad,koi_smass,koi_steff,koi_slogg,koi_smet
How to Get Started with the Model
You can load this model directly from the Hugging Face Hub using joblib and huggingface_hub.
1. Installation
pip install huggingface_hub joblib pandas scikit-learn lightgbm xgboost catboost
2. Python Inference Code
import joblib
import pandas as pd
from huggingface_hub import hf_hub_download
# 1. Download the model and label encoder
repo_id = "DarwinDanish/exoplanet-classifier-stacking"
model_path = hf_hub_download(repo_id=repo_id, filename="exo_stacking_pipeline.pkl")
encoder_path = hf_hub_download(repo_id=repo_id, filename="exo_label_encoder.pkl")
# 2. Load the artifacts
pipeline = joblib.load(model_path)
label_encoder = joblib.load(encoder_path)
# 3. Create sample data (Example: A likely planet candidate)
# Note: The model handles NaNs, so missing values are allowed.
data = {
'koi_period': [365.25],
'koi_depth': [1000.5],
'koi_prad': [1.02], # Earth radii
'koi_sma': [1.0], # AU
'koi_teq': [255.0], # Kelvin
'koi_insol': [1.0],
'koi_model_snr': [35.5],
# Aux features (can be mostly defaults or NaNs)
'koi_time0bk': [135.0],
'koi_duration': [4.5],
'koi_incl': [89.9],
'koi_srho': [1.0],
'koi_srad': [1.0],
'koi_smass': [1.0],
'koi_steff': [5700],
'koi_slogg': [4.5],
'koi_smet': [0.0]
}
df_new = pd.DataFrame(data)
# 4. Predict
prediction_index = pipeline.predict(df_new)
prediction_label = label_encoder.inverse_transform(prediction_index)
probabilities = pipeline.predict_proba(df_new)
print(f"Prediction: {prediction_label[0]}")
print(f"Confidence: {max(probabilities[0]):.4f}")
Training Details
Training Procedure
The model was trained using a robust preprocessing pipeline followed by a Stacking Classifier.
1. Preprocessing
The pipeline splits features into two groups with different imputation strategies:
- Critical Features: Missing values filled with constant
-999. Scaled viaStandardScaler. - Auxiliary Features: Missing values filled with the
median. Scaled viaStandardScaler.
2. Architecture
- Level 0 (Base Learners):
- LightGBM: (500 estimators, GPU accelerated)
- XGBoost: (500 estimators, Histogram tree method, GPU accelerated)
- CatBoost: (500 estimators, Depth 8, GPU accelerated)
- Level 1 (Meta Learner):
- LightGBM: (200 estimators) - Aggregates the probabilities from Level 0 to make the final decision.
Feature Importance
Based on the base learners, the most critical features for classification were identified as:
koi_model_snr(Signal-to-Noise Ratio)koi_prad(Planetary Radius)koi_depth(Transit Depth)koi_period(Orbital Period)
Evaluation Results
The model was evaluated on a held-out test set (20% of data) using stratified splitting.
- Accuracy: ~90%+ (Dependent on specific test split)
- Precision/Recall: High precision in distinguishing False Positives from Candidates.
Bias, Risks, and Limitations
- Data Specificity: This model is trained specifically on Kepler mission data. It may not generalize well to data from TESS or JWST without fine-tuning, as the instrumentation and noise profiles differ.
- Class Imbalance: Depending on the dataset version, "False Positives" are often more numerous than "Confirmed" planets, which can bias the model slightly toward false positive predictions in low-SNR ranges.
Environmental Impact
- Compute: Trained on GPU (NVIDIA T4/P100 class) via Kaggle Kernels.
- Training Time: < 5 minutes due to GPU acceleration and efficient gradient boosting implementations.
- Downloads last month
- -