Model Card: Exoplanet Candidate Classifier (Stacking Ensemble)

Model Details

Model Description

This is a robust machine learning pipeline designed to classify Kepler Objects of Interest (KOIs). It determines whether a detected signal represents a real exoplanet or a false positive.

The model utilizes a Stacking Ensemble architecture, combining the predictions of three powerful gradient boosting frameworks (LightGBM, XGBoost, and CatBoost) and aggregating them using a final LightGBM meta-learner. It is specifically engineered to handle missing data (NaNs) in scientific datasets through a dual-strategy imputation pipeline.

  • Developed by: [Darwin Danish]
  • Model Type: Scikit-learn Pipeline (StackingClassifier)
  • Input: Tabular data (16 astrophysical features)
  • Output: Multi-class classification (CANDIDATE, CONFIRMED, FALSE POSITIVE)

Model Sources


Uses

Direct Use

This model is intended for astronomers, data scientists, and space enthusiasts who want to analyze Kepler mission data or similar photometric datasets. It predicts the "disposition" of a celestial object based on its physical properties.

Supported Features (Input)

To use this model, your input DataFrame must contain the following columns:

Critical Features:

  • koi_period: Orbital period
  • koi_depth: Transit depth
  • koi_prad: Planetary radius
  • koi_sma: Semi-major axis
  • koi_teq: Equilibrium temperature
  • koi_insol: Insolation flux
  • koi_model_snr: Signal-to-Noise Ratio

Auxiliary Features:

  • koi_time0bk, koi_duration, koi_incl, koi_srho, koi_srad, koi_smass, koi_steff, koi_slogg, koi_smet

How to Get Started with the Model

You can load this model directly from the Hugging Face Hub using joblib and huggingface_hub.

1. Installation

pip install huggingface_hub joblib pandas scikit-learn lightgbm xgboost catboost

2. Python Inference Code

import joblib
import pandas as pd
from huggingface_hub import hf_hub_download

# 1. Download the model and label encoder
repo_id = "DarwinDanish/exoplanet-classifier-stacking"

model_path = hf_hub_download(repo_id=repo_id, filename="exo_stacking_pipeline.pkl")
encoder_path = hf_hub_download(repo_id=repo_id, filename="exo_label_encoder.pkl")

# 2. Load the artifacts
pipeline = joblib.load(model_path)
label_encoder = joblib.load(encoder_path)

# 3. Create sample data (Example: A likely planet candidate)
# Note: The model handles NaNs, so missing values are allowed.
data = {
    'koi_period': [365.25],
    'koi_depth': [1000.5],
    'koi_prad': [1.02],   # Earth radii
    'koi_sma': [1.0],     # AU
    'koi_teq': [255.0],   # Kelvin
    'koi_insol': [1.0],
    'koi_model_snr': [35.5],
    # Aux features (can be mostly defaults or NaNs)
    'koi_time0bk': [135.0],
    'koi_duration': [4.5],
    'koi_incl': [89.9],
    'koi_srho': [1.0],
    'koi_srad': [1.0],
    'koi_smass': [1.0],
    'koi_steff': [5700],
    'koi_slogg': [4.5],
    'koi_smet': [0.0]
}

df_new = pd.DataFrame(data)

# 4. Predict
prediction_index = pipeline.predict(df_new)
prediction_label = label_encoder.inverse_transform(prediction_index)
probabilities = pipeline.predict_proba(df_new)

print(f"Prediction: {prediction_label[0]}")
print(f"Confidence: {max(probabilities[0]):.4f}")

Training Details

Training Procedure

The model was trained using a robust preprocessing pipeline followed by a Stacking Classifier.

1. Preprocessing

The pipeline splits features into two groups with different imputation strategies:

  • Critical Features: Missing values filled with constant -999. Scaled via StandardScaler.
  • Auxiliary Features: Missing values filled with the median. Scaled via StandardScaler.

2. Architecture

  • Level 0 (Base Learners):
    • LightGBM: (500 estimators, GPU accelerated)
    • XGBoost: (500 estimators, Histogram tree method, GPU accelerated)
    • CatBoost: (500 estimators, Depth 8, GPU accelerated)
  • Level 1 (Meta Learner):
    • LightGBM: (200 estimators) - Aggregates the probabilities from Level 0 to make the final decision.

Feature Importance

Based on the base learners, the most critical features for classification were identified as:

  1. koi_model_snr (Signal-to-Noise Ratio)
  2. koi_prad (Planetary Radius)
  3. koi_depth (Transit Depth)
  4. koi_period (Orbital Period)

Evaluation Results

The model was evaluated on a held-out test set (20% of data) using stratified splitting.

  • Accuracy: ~90%+ (Dependent on specific test split)
  • Precision/Recall: High precision in distinguishing False Positives from Candidates.

Bias, Risks, and Limitations

  • Data Specificity: This model is trained specifically on Kepler mission data. It may not generalize well to data from TESS or JWST without fine-tuning, as the instrumentation and noise profiles differ.
  • Class Imbalance: Depending on the dataset version, "False Positives" are often more numerous than "Confirmed" planets, which can bias the model slightly toward false positive predictions in low-SNR ranges.

Environmental Impact

  • Compute: Trained on GPU (NVIDIA T4/P100 class) via Kaggle Kernels.
  • Training Time: < 5 minutes due to GPU acceleration and efficient gradient boosting implementations.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using DarwinDanish/exoplanet-classifier-stacking 1