File size: 5,261 Bytes
1b9f350
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0d71082
1b9f350
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0d71082
 
 
 
1b9f350
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
---
base_model: answerdotai/ModernBERT-base
library_name: peft
tags:
- text-classification
- reddit
- conversation-analysis
- constructive-dialogue
- modernbert
- lora
- transformers
- lightweight
- high-throughput
language:
- en
datasets:
- reddit
pipeline_tag: text-classification
repo_url: https://github.com/Niklas257/Reddit-Constructiveness-Classification.git
---

# ModernBERT Reddit Discussion Classifier

A lightweight, high-throughput ModernBERT-based model for classifying constructive vs non-constructive conversations in online forums like Reddit. Optimized for processing vast amounts of Reddit discussion data efficiently.

## Model Description

This model is a QLoRA (Quantized LoRA) fine-tuned version of `answerdotai/ModernBERT-base` specifically designed as a **lightweight** solution for large-scale Reddit discussion analysis.

- **Model Type**: Text Classification (Binary)
- **Base Model**: answerdotai/ModernBERT-base
- **Training Method**: QLoRA with self-training
- **Task**: Binary classification of conversation constructiveness
- **Language**: English

### Model Source

- **Repository**: https://github.com/Niklas257/Reddit-Constructiveness-Classification.git

## Intended Uses

### Primary Use Case
- Classifying Reddit discussions as constructive or non-constructive
- Content moderation assistance
- Large-scale conversation quality analysis
- Social media research

### Direct Use
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel
import torch

# Load base model and tokenizer
base_model_name = "answerdotai/ModernBERT-base"
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    base_model_name,
    num_labels=2
)

# Load the fine-tuned adapters
model = PeftModel.from_pretrained(model, "NiklasKoch/modernbert-discussion-classifier")
model.eval()

# Classify text (optimized for batch processing)
def classify_text(text):
    inputs = tokenizer(
        text, 
        return_tensors="pt", 
        truncation=True, 
        padding=True, 
        max_length=4096
    )
    
    # Move inputs to same device as model (important for GPU usage)
    inputs = {k: v.to(next(model.parameters()).device) for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
        
    # 0 = non-constructive, 1 = constructive
    predicted_class = torch.argmax(predictions, dim=-1).item()
    confidence = predictions[0][predicted_class].item()
    
    return {
        'class': 'constructive' if predicted_class == 1 else 'non-constructive',
        'confidence': confidence,
        'scores': {
            'non-constructive': predictions[0][0].item(),
            'constructive': predictions[0][1].item()
        }
    }

# Example usage - Reddit discussion
text = "[author0] LEGO: What do you think you're doing?!? [author1] I don't get it did he reveal bionicle reboot or smthn? [author2] Not really, he did announce something but was super vague, seems like a sort of passion project we wants to do with the community, he even said it might not even be bionicle. [author1] So is that image fan made or is it one of his passion projects [author2] Those pictures are real and on his insta, he did a stream talking about it I'm sure you can find somewhere, search up Fabre bionicle stream 2020 or something. [author1] OK thanks"
result = classify_text(text)
print(result)
```

## Training Details

### Training Data
- **Source**: https://archive.org/download/pushshift_reddit_200506_to_202212/
- **Size**: ~1.4 million Reddit threads filtered for English language and minimum 2 authors
- **Labels**: Binary (constructive/non-constructive conversations)
- **Additional Data**: YNACC and IAC datasets for initial supervised training

### Training Procedure
- **Training Method**: Self-training
- **Quantization**: 4-bit QLoRA for efficiency
- **LoRA Config**:
  - `r`: 16
  - `lora_alpha`: 32
  - `lora_dropout`: 0.1
  - Target modules: `Wqkv`, `Wo`, `Wi`, `dense`
- **Loss Function**: Focal Loss with class weighting
- **Max Sequence Length**: 4096 tokens
- **Batch Size**: 64
- **Learning Rate**: 2e-6

### Training Hardware
- 48 hours on 4x NVIDIA A100 40GB GPUs

## Performance

### Evaluation Results

```
YNACC:
Accuracy: 0.63
Precision: 0.63
F1-Score: 0.65

IAC:
Accuracy: 0.79
Precision: 0.85
F1-Score: 0.87

Reddit:
Accuracy: 0.57
Precision: 0.74
F1-Score: 0.67
```

## Limitations and Bias

- **Language**: English only
- **Bias**: May reflect biases present in Reddit discussions and training data

## Ethical Considerations

- Human oversight is recommended for important moderation decisions  

## Technical Specifications

- **Model Architecture**: ModernBERT + Classification Head
- **Parameters**: ~150M base + LoRA adapters + classification head
- **Precision**: 4-bit quantized base model with full-precision adapters
- **Framework**: PyTorch, Transformers, PEFT (any recent version - you may see harmless warnings about configuration parameters)

## Model Card Authors

Niklas Koch, Georg August University of Göttingen

## Model Card Contact

[email protected]