Human vs. AI Text Classifier
Model Description
A comprehensive ensemble-based text classification system that distinguishes between human-written and AI-generated text with high accuracy. This implementation combines traditional machine learning (Logistic Regression, Random Forest, SVM, XGBoost) and deep learning approaches (BiLSTM with Attention, BERT) using advanced ensemble techniques.
Key Features:
- 6 diverse classifiers (4 traditional ML + 2 deep learning)
- 5,015-dimensional hybrid feature space (5,000 TF-IDF + 15 linguistic features)
- 4 ensemble strategies (Hard/Soft Voting, Weighted Average, Stacking)
- 99.59% F1-score with weighted ensemble
- Balanced performance (99.59% precision, recall, and accuracy)
- Trained on 52,452 samples from HC3 dataset
Model Architecture
Input Text
β
[Feature Engineering]
βββ TF-IDF Vectorization (5,000 features)
β - Unigrams & Bigrams
β - Max DF: 0.95, Min DF: 2
β
βββ Linguistic Features (15 features)
- Text length, word count, sentence count
- Lexical diversity (TTR)
- Stopword/punctuation ratios
- Statistical text properties
β
Multi-Modal Feature Vector (5,015 dimensions)
β
ββββββββββββββββββββββββββββββββββββββββββββββββ
β Base Classifiers (Parallel) β
ββββββββββββββββββββββββββββββββββββββββββββββββ€
β Traditional ML β Deep Learning β
βββββββββββββββββββββββββββΌβββββββββββββββββββββ€
β β’ Logistic Regression β β’ BERT β
β β’ Random Forest (200) β (bert-base) β
β β’ SVM (RBF kernel) β β’ BiLSTM+Attention β
β β’ XGBoost (200 trees) β (64 units) β
βββββββββββββββββββββββββββ΄βββββββββββββββββββββ
β
[Ensemble Aggregation]
βββ Hard Voting (Majority vote)
βββ Soft Voting (Probability averaging)
βββ Weighted Average (Optimized weights)
βββ Stacking (Meta-learner: Logistic Regression)
β
Final Prediction: Human (0) or AI (1)
Individual Model Specifications:
| Model | Type | Parameters | Key Configuration |
|---|---|---|---|
| Logistic Regression | Linear | 5,015 | C=1.0, L2 regularization, LBFGS solver |
| Random Forest | Ensemble Trees | - | 200 estimators, unlimited depth |
| SVM | Kernel Method | - | RBF kernel, C=1.0, gamma=scale |
| XGBoost | Gradient Boosting | - | 200 trees, depth=7, LR=0.1 |
| BiLSTM | Recurrent NN | ~500K | 64 units/dir, attention, dropout=0.5 |
| BERT | Transformer | 110M | bert-base-uncased, max_len=128 |
Performance
Individual Models
| Model | Accuracy | Precision | Recall | F1-Score | ROC-AUC |
|---|---|---|---|---|---|
| XGBoost | 0.9903 | 0.9838 | 0.9970 | 0.9904 | 0.9994 |
| Logistic Regression | 0.9897 | 0.9827 | 0.9970 | 0.9898 | 0.9996 |
| SVM | 0.9867 | 0.9807 | 0.9929 | 0.9867 | 0.9991 |
| BERT | 0.9727 | 0.9510 | 0.9967 | 0.9733 | 0.9975 |
| BiLSTM | 0.9710 | 0.9668 | 0.9756 | 0.9712 | 0.9963 |
| Random Forest | 0.9573 | 0.9571 | 0.9576 | 0.9573 | 0.9922 |
Ensemble Methods
| Method | Accuracy | Precision | Recall | F1-Score | ROC-AUC |
|---|---|---|---|---|---|
| Weighted Average β | 0.9959 | 0.9959 | 0.9959 | 0.9959 | 0.9998 |
| Stacking | 0.9956 | 0.9947 | 0.9964 | 0.9956 | 0.9998 |
| Soft Voting | 0.9945 | 0.9937 | 0.9954 | 0.9945 | 0.9998 |
| Hard Voting | 0.9921 | 0.9944 | 0.9898 | 0.9921 | 0.9998 |
Optimized Ensemble Weights:
- XGBoost: 0.25
- Logistic Regression: 0.20
- BERT: 0.20
- SVM: 0.15
- Random Forest: 0.10
- BiLSTM: 0.10
Confusion Matrix (Weighted Ensemble):
Predicted
Human AI
Actual Human 3918 16
AI 16 3918
- Total Errors: 32 / 7,868 (0.41%)
- False Positives: 16 (0.20%)
- False Negatives: 16 (0.20%)
Training Details
Dataset:
- Name: HC3 (Human-ChatGPT Comparison Corpus)
- Total Samples: 52,452 balanced pairs
- Training: 36,716 (70%)
- Validation: 7,868 (15%)
- Test: 7,868 (15%)
- Domains: Finance, Medicine, Open QA, Reddit ELI5, Wikipedia CS/AI
- Minimum Length: 50 characters
- Balance: 50-50 (Human-AI)
Feature Engineering:
- TF-IDF: 5,000 dimensions (unigrams + bigrams)
- Linguistic: 15 handcrafted features
- Text statistics (length, word/sentence counts)
- Lexical diversity (Type-Token Ratio)
- Character ratios (stopwords, punctuation, digits, capitals)
- Structural patterns (long/short words, question/exclamation marks)
Training Configuration:
Traditional ML Models:
- Framework: scikit-learn 1.3+
- Cross-validation: 5-fold (for stacking)
- Class balance: Maintained via stratified splitting
Deep Learning Models:
- BiLSTM: 10 epochs (early stopped at 4), batch=64, Adam optimizer (LR=1e-3)
- BERT: 2 epochs, batch=16, AdamW optimizer (LR=2e-5), warmup=500 steps
Hardware:
- Training: CPU/GPU compatible
- BiLSTM training time: 3,406 seconds (4 epochs)
- BERT training time: Variable (depends on GPU)
Usage
Installation
git clone https://github.com/huzaifanasir95/Human-vs-AI-Classifier.git
cd Human-vs-AI-Classifier
pip install -r requirements.txt
Download Models
from huggingface_hub import hf_hub_download
import pickle
import torch
# Download traditional ML models
models = ['logistic_regression', 'random_forest', 'svm', 'xgboost']
for model_name in models:
model_path = hf_hub_download(
repo_id="huzaifanasirrr/human-vs-ai-text-classifier",
filename=f"models/{model_name}.pkl"
)
with open(model_path, 'rb') as f:
model = pickle.load(f)
# Download deep learning models
bert_path = hf_hub_download(
repo_id="huzaifanasirrr/human-vs-ai-text-classifier",
filename="models/bert_best.pt"
)
bilstm_path = hf_hub_download(
repo_id="huzaifanasirrr/human-vs-ai-text-classifier",
filename="models/bilstm_best.h5"
)
Inference (Weighted Ensemble)
from src.feature_extractor import FeatureExtractor
from src.models.ensemble import WeightedEnsemble
import numpy as np
# Initialize feature extractor
feature_extractor = FeatureExtractor(
max_features=5000,
ngram_range=(1, 2)
)
# Extract features from text
text = "Your text to classify here..."
features = feature_extractor.extract(text) # Shape: (5015,)
# Load ensemble
ensemble = WeightedEnsemble(
models=[lr_model, rf_model, svm_model, xgb_model, bert_model, bilstm_model],
weights=[0.20, 0.10, 0.15, 0.25, 0.20, 0.10]
)
# Predict
prediction = ensemble.predict(features)
probability = ensemble.predict_proba(features)
if prediction == 0:
print(f"Human-written (confidence: {probability[0]:.2%})")
else:
print(f"AI-generated (confidence: {probability[1]:.2%})")
Single Model Inference
# Using XGBoost (best individual model)
xgb_prediction = xgb_model.predict(features.reshape(1, -1))
xgb_proba = xgb_model.predict_proba(features.reshape(1, -1))
print(f"Prediction: {'AI' if xgb_prediction[0] else 'Human'}")
print(f"Confidence: {xgb_proba[0][xgb_prediction[0]]:.2%}")
Key Innovations
- Hybrid Feature Engineering: Combines vocabulary-based TF-IDF with linguistic style features
- Multi-Paradigm Ensemble: Integrates linear models, tree ensembles, kernel methods, and neural networks
- Optimized Weighting: Performance-based weight assignment for ensemble members
- Balanced Performance: Equal precision and recall (99.59%) indicates no systematic bias
- Domain Diversity: Trained across 5 different text domains for robust generalization
Feature Importance
Based on XGBoost analysis:
| Feature Type | Importance |
|---|---|
| TF-IDF Features | 89.2% |
| Average Sentence Length | 4.3% |
| Lexical Diversity (TTR) | 2.7% |
| Unique Words Ratio | 1.5% |
| Average Word Length | 1.1% |
| Others | 1.2% |
Insight: Vocabulary patterns dominate, but linguistic features provide crucial complementary information.
Limitations
- Dataset Specificity: Trained on ChatGPT-generated text; may not generalize to other LLMs (GPT-4, Claude, Gemini)
- Domain Dependency: Best performance on domains similar to training data
- Temporal Drift: As LLMs evolve, detection patterns may become obsolete
- Adversarial Vulnerability: Not evaluated against deliberate evasion attempts
- Language: English-only (no multilingual support)
- Computational Cost: Full ensemble requires running 6 models (mitigated by optimized weights)
Citation
If you use this model in your research, please cite:
@article{nasir2025humanaiclassifier,
title={Human vs. AI Text Classification: A Comprehensive Study Using Machine Learning and Deep Learning Approaches},
author={Nasir, Huzaifa},
institution={National University of Computer and Emerging Sciences, Pakistan},
year={2025},
note={Hugging Face: https://huggingface.co/huzaifanasirrr/human-vs-ai-text-classifier}
}
HC3 Dataset:
@article{guo2023hc3,
title={How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection},
author={Guo, Biyang and Zhang, Xin and Wang, Ziyuan and Jiang, Minqi and Nie, Jinran and Ding, Yuxuan and ... and Wu, Yupeng},
journal={arXiv preprint arXiv:2301.07597},
year={2023}
}
Model Files
models/*.pkl- Traditional ML models (Logistic Regression, Random Forest, SVM, XGBoost)models/bert_best.pt- Fine-tuned BERT model checkpointmodels/bilstm_best.h5- BiLSTM with Attention modelresults/*.json- Comprehensive performance metricsdata/feature_info.json- Feature vocabulary and metadatavisualizations/*.png- Training curves, confusion matrices, ROC curves, comparisonsconfig.yaml- Configuration settingsresearch_paper.tex- Full research paper (SPRINGER LNCS format)
Ethical Considerations
β οΈ Important Notice:
This model is designed for research and educational purposes. When deploying for real-world applications:
- Transparency: Inform users when text is subject to AI detection
- Fairness: Evaluate for bias against non-native speakers or specific writing styles
- Privacy: Respect user privacy and data protection regulations
- Accuracy: Do not use as definitive proof; false positives (0.2%) can occur
- Context: Use as one signal among many, not as sole evidence
- Appeals: Provide mechanisms for users to contest decisions
Detection systems should support human judgment, not replace it.
Author
Huzaifa Nasir
π§ [email protected]
π National University of Computer and Emerging Sciences (FAST-NUCES), Pakistan
π GitHub Repository
π ORCID: 0009-0000-1482-3268
License
MIT License - See LICENSE file for details.
Acknowledgments
This project builds upon:
- HC3 Dataset: Human-ChatGPT comparison corpus (Guo et al., 2023)
- BERT: Pre-trained language model (Devlin et al., 2018)
- XGBoost: Gradient boosting framework (Chen & Guestrin, 2016)
- Scikit-learn: Machine learning library (Pedregosa et al., 2011)
Research conducted at FAST-NUCES Islamabad. Special thanks to the open-source community.
Status: β Production-ready | Last updated: January 2025
- Downloads last month
- 3