Customer Conversion Predictor
This repository contains a high-performance ensemble model for predicting customer conversion in digital marketing campaigns.
The model is a VotingClassifier that combines the strengths of four finely-tuned gradient boosting models: CatBoost, XGBoost, LightGBM, and Gradient Boosting. It was trained on a digital marketing dataset to identify customers with a high likelihood of making a purchase.
๐งพ Model Details
- Model Type: Tabular Classification Ensemble
- Libraries Used: Scikit-learn, CatBoost, XGBoost, LightGBM, Optuna
- Primary Use: Predicts a binary outcome (convert or not convert) based on customer demographic and engagement data.
๐ฏ Intended Use
This model is intended for use by marketing teams to optimize campaign performance. The primary use cases are:
- Audience Segmentation: Identify and target high-value customers who are most likely to convert.
- Budget Allocation: Optimize Return on Ad Spend (ROAS) by focusing resources on effective channels and audiences.
- Campaign Strategy: Gain insights into the factors that drive conversion to inform marketing strategy.
Limitations & Bias
- Generalization: The model's performance is dependent on the characteristics of the training data. It may not generalize well to marketing campaigns in different industries, geographical locations, or with vastly different customer demographics without being re-trained.
- Data Bias: The training data may contain inherent biases (e.g., historical targeting of certain age groups or income levels). The model will learn and potentially amplify these biases. It is crucial to perform a bias audit before deploying this model in a live production environment.
- Temporal Drift: Customer behavior changes over time. The model's performance may degrade, and it should be periodically re-evaluated and re-trained on newer data.
๐ How to Use
This model was saved using joblib and can be loaded for inference as shown below.
Important: The input data for prediction must have the exact same columns and preprocessing as the data used for training the model. This includes the creation of engineered features like EngagementScore.
import joblib
import pandas as pd
import numpy as np
# --- 1. Load the Trained Model ---
# This assumes 'final_submission_model.pkl' is in the same directory.
try:
model = joblib.load("conversion_prediction_model.pkl")
print("โ
Model loaded successfully.")
except FileNotFoundError:
print("โ Error: Ensure 'final_submission_model.pkl' is in the same directory as this script.")
exit()
# --- 2. Define the Columns the Model was Trained On ---
# This list MUST BE EXACTLY the same as the columns from your training data (X_train.columns)
# It includes the original, engineered, and one-hot encoded columns in the correct order.
TRAINING_COLUMNS = [
'Age', 'Income', 'WebsiteVisits', 'TimeOnSite', 'PagesPerVisit', 'AdSpend',
'EmailSubscriptions', 'SocialMediaEngagement', 'PreviousPurchases',
'LoyaltyPoints', 'EngagementScore', 'CostPerVisit', 'Gender_Female',
'Gender_Male', 'DeviceType_Desktop', 'DeviceType_Mobile',
'TrafficSource_Organic', 'TrafficSource_Paid', 'TrafficSource_Referral',
'AgeGroup_Adult', 'AgeGroup_Senior', 'AgeGroup_Young',
'IncomeTier_High', 'IncomeTier_Low', 'IncomeTier_Medium', 'IncomeTier_Very High'
]
def preprocess_for_prediction(raw_data_dict):
"""
Takes a dictionary of raw data and preprocesses it for the model.
"""
# Convert dictionary to a DataFrame
df = pd.DataFrame([raw_data_dict])
# --- Step A: Feature Engineering ---
# Create 'EngagementScore'
df['EngagementScore'] = df['TimeOnSite'] * df['PagesPerVisit']
# Create 'CostPerVisit' and handle potential division by zero
df['CostPerVisit'] = (df['AdSpend'] / df['WebsiteVisits']).replace([np.inf, -np.inf], 0).fillna(0)
# --- Step B: Binning for Age and Income ---
# AgeGroup Bins
age_bins = [0, 25, 45, 60, np.inf]
age_labels = ['Young', 'Adult', 'Senior', 'Very Senior'] # Adjusted to match potential notebook logic
df['AgeGroup'] = pd.cut(df['Age'], bins=age_bins, labels=age_labels, right=False)
# IncomeTier Bins (using quartiles as an example)
income_bins = [0, 45000, 85000, 120000, np.inf]
income_labels = ['Low', 'Medium', 'High', 'Very High']
df['IncomeTier'] = pd.cut(df['Income'], bins=income_bins, labels=income_labels, right=False)
# --- Step C: One-Hot Encoding ---
# Use pd.get_dummies for categorical columns
df = pd.get_dummies(df, columns=['Gender', 'DeviceType', 'TrafficSource', 'AgeGroup', 'IncomeTier'])
# --- Step D: Align Columns with Training Data ---
# Get all columns from the processed DataFrame
current_columns = df.columns
# Align the new data's columns with the original training columns
# This adds any missing one-hot encoded columns (and fills with 0)
# and ensures the final order is identical to the one the model was trained on.
aligned_df = df.reindex(columns=TRAINING_COLUMNS, fill_value=0)
return aligned_df
# --- 3. Create Sample Raw Data Point ---
# This dictionary represents a single new customer in its original format.
new_customer_data = {
'Age': 38,
'Gender': 'Male',
'Income': 78000.0,
'WebsiteVisits': 15,
'TimeOnSite': 18.2,
'PagesPerVisit': 4.5,
'AdSpend': 150.0,
'EmailSubscriptions': 1,
'SocialMediaEngagement': 450,
'PreviousPurchases': 2,
'LoyaltyPoints': 1250,
'DeviceType': 'Desktop',
'TrafficSource': 'Organic'
}
# --- 4. Preprocess the Data and Make Prediction ---
# Pass the raw data through the complete preprocessing pipeline
processed_input = preprocess_for_prediction(new_customer_data)
# Make the prediction using the fully preprocessed data
prediction = model.predict(processed_input)
prediction_proba = model.predict_proba(processed_input)
# --- 5. Display the Result ---
print("\n--- Prediction Results ---")
print(f"Input Data: {new_customer_data}")
if prediction[0] == 1:
print("\n๐ฎ Prediction: Customer WILL CONVERT")
else:
print("\n๐ฎ Prediction: Customer WILL NOT CONVERT")
print(f"Confidence Score (Probability of Conversion): {prediction_proba[0][1]:.2%}")
๐๏ธโโ๏ธ Training Procedure
The model was trained using a comprehensive pipeline:
- Data Cleaning: Handled missing values and removed outliers using an Isolation Forest.
- Feature Engineering: Created new features (
EngagementScore,CostPerVisit) and binned numerical features (Age,Income) to capture non-linear patterns. Categorical features were one-hot encoded. - Hyperparameter Tuning: Used Optuna to perform an extensive search for the optimal hyperparameters for CatBoost, XGBoost, LightGBM, and Gradient Boosting models.
- Ensemble Construction: Combined the four tuned models into a single, robust
VotingClassifier.
๐ Evaluation Results
The final ensemble model achieved the following performance on the hold-out test set:
| Metric | Score |
|---|---|
| Accuracy | 92.21% |
| F1-Score (Conversion) | 0.9569 |
| Precision (Conversion) | 0.9326 |
| Recall (Conversion) | 0.9821 |
This indicates a strong ability to both accurately identify customers who will convert and to avoid misclassifying those who will not.