File size: 5,261 Bytes
1b9f350 0d71082 1b9f350 0d71082 1b9f350 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 |
---
base_model: answerdotai/ModernBERT-base
library_name: peft
tags:
- text-classification
- reddit
- conversation-analysis
- constructive-dialogue
- modernbert
- lora
- transformers
- lightweight
- high-throughput
language:
- en
datasets:
- reddit
pipeline_tag: text-classification
repo_url: https://github.com/Niklas257/Reddit-Constructiveness-Classification.git
---
# ModernBERT Reddit Discussion Classifier
A lightweight, high-throughput ModernBERT-based model for classifying constructive vs non-constructive conversations in online forums like Reddit. Optimized for processing vast amounts of Reddit discussion data efficiently.
## Model Description
This model is a QLoRA (Quantized LoRA) fine-tuned version of `answerdotai/ModernBERT-base` specifically designed as a **lightweight** solution for large-scale Reddit discussion analysis.
- **Model Type**: Text Classification (Binary)
- **Base Model**: answerdotai/ModernBERT-base
- **Training Method**: QLoRA with self-training
- **Task**: Binary classification of conversation constructiveness
- **Language**: English
### Model Source
- **Repository**: https://github.com/Niklas257/Reddit-Constructiveness-Classification.git
## Intended Uses
### Primary Use Case
- Classifying Reddit discussions as constructive or non-constructive
- Content moderation assistance
- Large-scale conversation quality analysis
- Social media research
### Direct Use
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel
import torch
# Load base model and tokenizer
base_model_name = "answerdotai/ModernBERT-base"
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
model = AutoModelForSequenceClassification.from_pretrained(
base_model_name,
num_labels=2
)
# Load the fine-tuned adapters
model = PeftModel.from_pretrained(model, "NiklasKoch/modernbert-discussion-classifier")
model.eval()
# Classify text (optimized for batch processing)
def classify_text(text):
inputs = tokenizer(
text,
return_tensors="pt",
truncation=True,
padding=True,
max_length=4096
)
# Move inputs to same device as model (important for GPU usage)
inputs = {k: v.to(next(model.parameters()).device) for k, v in inputs.items()}
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
# 0 = non-constructive, 1 = constructive
predicted_class = torch.argmax(predictions, dim=-1).item()
confidence = predictions[0][predicted_class].item()
return {
'class': 'constructive' if predicted_class == 1 else 'non-constructive',
'confidence': confidence,
'scores': {
'non-constructive': predictions[0][0].item(),
'constructive': predictions[0][1].item()
}
}
# Example usage - Reddit discussion
text = "[author0] LEGO: What do you think you're doing?!? [author1] I don't get it did he reveal bionicle reboot or smthn? [author2] Not really, he did announce something but was super vague, seems like a sort of passion project we wants to do with the community, he even said it might not even be bionicle. [author1] So is that image fan made or is it one of his passion projects [author2] Those pictures are real and on his insta, he did a stream talking about it I'm sure you can find somewhere, search up Fabre bionicle stream 2020 or something. [author1] OK thanks"
result = classify_text(text)
print(result)
```
## Training Details
### Training Data
- **Source**: https://archive.org/download/pushshift_reddit_200506_to_202212/
- **Size**: ~1.4 million Reddit threads filtered for English language and minimum 2 authors
- **Labels**: Binary (constructive/non-constructive conversations)
- **Additional Data**: YNACC and IAC datasets for initial supervised training
### Training Procedure
- **Training Method**: Self-training
- **Quantization**: 4-bit QLoRA for efficiency
- **LoRA Config**:
- `r`: 16
- `lora_alpha`: 32
- `lora_dropout`: 0.1
- Target modules: `Wqkv`, `Wo`, `Wi`, `dense`
- **Loss Function**: Focal Loss with class weighting
- **Max Sequence Length**: 4096 tokens
- **Batch Size**: 64
- **Learning Rate**: 2e-6
### Training Hardware
- 48 hours on 4x NVIDIA A100 40GB GPUs
## Performance
### Evaluation Results
```
YNACC:
Accuracy: 0.63
Precision: 0.63
F1-Score: 0.65
IAC:
Accuracy: 0.79
Precision: 0.85
F1-Score: 0.87
Reddit:
Accuracy: 0.57
Precision: 0.74
F1-Score: 0.67
```
## Limitations and Bias
- **Language**: English only
- **Bias**: May reflect biases present in Reddit discussions and training data
## Ethical Considerations
- Human oversight is recommended for important moderation decisions
## Technical Specifications
- **Model Architecture**: ModernBERT + Classification Head
- **Parameters**: ~150M base + LoRA adapters + classification head
- **Precision**: 4-bit quantized base model with full-precision adapters
- **Framework**: PyTorch, Transformers, PEFT (any recent version - you may see harmless warnings about configuration parameters)
## Model Card Authors
Niklas Koch, Georg August University of Göttingen
## Model Card Contact
[email protected]
|