|
|
--- |
|
|
base_model: answerdotai/ModernBERT-base |
|
|
library_name: peft |
|
|
tags: |
|
|
- text-classification |
|
|
- reddit |
|
|
- conversation-analysis |
|
|
- constructive-dialogue |
|
|
- modernbert |
|
|
- lora |
|
|
- transformers |
|
|
- lightweight |
|
|
- high-throughput |
|
|
language: |
|
|
- en |
|
|
datasets: |
|
|
- reddit |
|
|
pipeline_tag: text-classification |
|
|
repo_url: https://github.com/Niklas257/Reddit-Constructiveness-Classification.git |
|
|
--- |
|
|
|
|
|
# ModernBERT Reddit Discussion Classifier |
|
|
|
|
|
A lightweight, high-throughput ModernBERT-based model for classifying constructive vs non-constructive conversations in online forums like Reddit. Optimized for processing vast amounts of Reddit discussion data efficiently. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model is a QLoRA (Quantized LoRA) fine-tuned version of `answerdotai/ModernBERT-base` specifically designed as a **lightweight** solution for large-scale Reddit discussion analysis. |
|
|
|
|
|
- **Model Type**: Text Classification (Binary) |
|
|
- **Base Model**: answerdotai/ModernBERT-base |
|
|
- **Training Method**: QLoRA with self-training |
|
|
- **Task**: Binary classification of conversation constructiveness |
|
|
- **Language**: English |
|
|
|
|
|
### Model Source |
|
|
|
|
|
- **Repository**: https://github.com/Niklas257/Reddit-Constructiveness-Classification.git |
|
|
|
|
|
## Intended Uses |
|
|
|
|
|
### Primary Use Case |
|
|
- Classifying Reddit discussions as constructive or non-constructive |
|
|
- Content moderation assistance |
|
|
- Large-scale conversation quality analysis |
|
|
- Social media research |
|
|
|
|
|
### Direct Use |
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
from peft import PeftModel |
|
|
import torch |
|
|
|
|
|
# Load base model and tokenizer |
|
|
base_model_name = "answerdotai/ModernBERT-base" |
|
|
tokenizer = AutoTokenizer.from_pretrained(base_model_name) |
|
|
model = AutoModelForSequenceClassification.from_pretrained( |
|
|
base_model_name, |
|
|
num_labels=2 |
|
|
) |
|
|
|
|
|
# Load the fine-tuned adapters |
|
|
model = PeftModel.from_pretrained(model, "NiklasKoch/modernbert-discussion-classifier") |
|
|
model.eval() |
|
|
|
|
|
# Classify text (optimized for batch processing) |
|
|
def classify_text(text): |
|
|
inputs = tokenizer( |
|
|
text, |
|
|
return_tensors="pt", |
|
|
truncation=True, |
|
|
padding=True, |
|
|
max_length=4096 |
|
|
) |
|
|
|
|
|
# Move inputs to same device as model (important for GPU usage) |
|
|
inputs = {k: v.to(next(model.parameters()).device) for k, v in inputs.items()} |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) |
|
|
|
|
|
# 0 = non-constructive, 1 = constructive |
|
|
predicted_class = torch.argmax(predictions, dim=-1).item() |
|
|
confidence = predictions[0][predicted_class].item() |
|
|
|
|
|
return { |
|
|
'class': 'constructive' if predicted_class == 1 else 'non-constructive', |
|
|
'confidence': confidence, |
|
|
'scores': { |
|
|
'non-constructive': predictions[0][0].item(), |
|
|
'constructive': predictions[0][1].item() |
|
|
} |
|
|
} |
|
|
|
|
|
# Example usage - Reddit discussion |
|
|
text = "[author0] LEGO: What do you think you're doing?!? [author1] I don't get it did he reveal bionicle reboot or smthn? [author2] Not really, he did announce something but was super vague, seems like a sort of passion project we wants to do with the community, he even said it might not even be bionicle. [author1] So is that image fan made or is it one of his passion projects [author2] Those pictures are real and on his insta, he did a stream talking about it I'm sure you can find somewhere, search up Fabre bionicle stream 2020 or something. [author1] OK thanks" |
|
|
result = classify_text(text) |
|
|
print(result) |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
- **Source**: https://archive.org/download/pushshift_reddit_200506_to_202212/ |
|
|
- **Size**: ~1.4 million Reddit threads filtered for English language and minimum 2 authors |
|
|
- **Labels**: Binary (constructive/non-constructive conversations) |
|
|
- **Additional Data**: YNACC and IAC datasets for initial supervised training |
|
|
|
|
|
### Training Procedure |
|
|
- **Training Method**: Self-training |
|
|
- **Quantization**: 4-bit QLoRA for efficiency |
|
|
- **LoRA Config**: |
|
|
- `r`: 16 |
|
|
- `lora_alpha`: 32 |
|
|
- `lora_dropout`: 0.1 |
|
|
- Target modules: `Wqkv`, `Wo`, `Wi`, `dense` |
|
|
- **Loss Function**: Focal Loss with class weighting |
|
|
- **Max Sequence Length**: 4096 tokens |
|
|
- **Batch Size**: 64 |
|
|
- **Learning Rate**: 2e-6 |
|
|
|
|
|
### Training Hardware |
|
|
- 48 hours on 4x NVIDIA A100 40GB GPUs |
|
|
|
|
|
## Performance |
|
|
|
|
|
### Evaluation Results |
|
|
|
|
|
``` |
|
|
YNACC: |
|
|
Accuracy: 0.63 |
|
|
Precision: 0.63 |
|
|
F1-Score: 0.65 |
|
|
|
|
|
IAC: |
|
|
Accuracy: 0.79 |
|
|
Precision: 0.85 |
|
|
F1-Score: 0.87 |
|
|
|
|
|
Reddit: |
|
|
Accuracy: 0.57 |
|
|
Precision: 0.74 |
|
|
F1-Score: 0.67 |
|
|
``` |
|
|
|
|
|
## Limitations and Bias |
|
|
|
|
|
- **Language**: English only |
|
|
- **Bias**: May reflect biases present in Reddit discussions and training data |
|
|
|
|
|
## Ethical Considerations |
|
|
|
|
|
- Human oversight is recommended for important moderation decisions |
|
|
|
|
|
## Technical Specifications |
|
|
|
|
|
- **Model Architecture**: ModernBERT + Classification Head |
|
|
- **Parameters**: ~150M base + LoRA adapters + classification head |
|
|
- **Precision**: 4-bit quantized base model with full-precision adapters |
|
|
- **Framework**: PyTorch, Transformers, PEFT (any recent version - you may see harmless warnings about configuration parameters) |
|
|
|
|
|
## Model Card Authors |
|
|
|
|
|
Niklas Koch, Georg August University of Göttingen |
|
|
|
|
|
## Model Card Contact |
|
|
|
|
|
[email protected] |
|
|
|