modernbert-discussion-classifier / README.md

Update README.md

0d71082 verified 3 months ago

5.26 kB

	---
	base_model: answerdotai/ModernBERT-base
	library_name: peft
	tags:
	- text-classification
	- reddit
	- conversation-analysis
	- constructive-dialogue
	- modernbert
	- lora
	- transformers
	- lightweight
	- high-throughput
	language:
	- en
	datasets:
	- reddit
	pipeline_tag: text-classification
	repo_url: https://github.com/Niklas257/Reddit-Constructiveness-Classification.git
	---

	# ModernBERT Reddit Discussion Classifier

	A lightweight, high-throughput ModernBERT-based model for classifying constructive vs non-constructive conversations in online forums like Reddit. Optimized for processing vast amounts of Reddit discussion data efficiently.

	## Model Description

	This model is a QLoRA (Quantized LoRA) fine-tuned version of `answerdotai/ModernBERT-base` specifically designed as a lightweight solution for large-scale Reddit discussion analysis.

	- Model Type: Text Classification (Binary)
	- Base Model: answerdotai/ModernBERT-base
	- Training Method: QLoRA with self-training
	- Task: Binary classification of conversation constructiveness
	- Language: English

	### Model Source

	- Repository: https://github.com/Niklas257/Reddit-Constructiveness-Classification.git

	## Intended Uses

	### Primary Use Case
	- Classifying Reddit discussions as constructive or non-constructive
	- Content moderation assistance
	- Large-scale conversation quality analysis
	- Social media research

	### Direct Use
	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	from peft import PeftModel
	import torch

	# Load base model and tokenizer
	base_model_name = "answerdotai/ModernBERT-base"
	tokenizer = AutoTokenizer.from_pretrained(base_model_name)
	model = AutoModelForSequenceClassification.from_pretrained(
	base_model_name,
	num_labels=2
	)

	# Load the fine-tuned adapters
	model = PeftModel.from_pretrained(model, "NiklasKoch/modernbert-discussion-classifier")
	model.eval()

	# Classify text (optimized for batch processing)
	def classify_text(text):
	inputs = tokenizer(
	text,
	return_tensors="pt",
	truncation=True,
	padding=True,
	max_length=4096
	)

	# Move inputs to same device as model (important for GPU usage)
	inputs = {k: v.to(next(model.parameters()).device) for k, v in inputs.items()}

	with torch.no_grad():
	outputs = model(**inputs)
	predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

	# 0 = non-constructive, 1 = constructive
	predicted_class = torch.argmax(predictions, dim=-1).item()
	confidence = predictions[0][predicted_class].item()

	return {
	'class': 'constructive' if predicted_class == 1 else 'non-constructive',
	'confidence': confidence,
	'scores': {
	'non-constructive': predictions[0][0].item(),
	'constructive': predictions[0][1].item()
	}
	}

	# Example usage - Reddit discussion
	text = "[author0] LEGO: What do you think you're doing?!? [author1] I don't get it did he reveal bionicle reboot or smthn? [author2] Not really, he did announce something but was super vague, seems like a sort of passion project we wants to do with the community, he even said it might not even be bionicle. [author1] So is that image fan made or is it one of his passion projects [author2] Those pictures are real and on his insta, he did a stream talking about it I'm sure you can find somewhere, search up Fabre bionicle stream 2020 or something. [author1] OK thanks"
	result = classify_text(text)
	print(result)
	```

	## Training Details

	### Training Data
	- Source: https://archive.org/download/pushshift_reddit_200506_to_202212/
	- Size: ~1.4 million Reddit threads filtered for English language and minimum 2 authors
	- Labels: Binary (constructive/non-constructive conversations)
	- Additional Data: YNACC and IAC datasets for initial supervised training

	### Training Procedure
	- Training Method: Self-training
	- Quantization: 4-bit QLoRA for efficiency
	- LoRA Config:
	- `r`: 16
	- `lora_alpha`: 32
	- `lora_dropout`: 0.1
	- Target modules: `Wqkv`, `Wo`, `Wi`, `dense`
	- Loss Function: Focal Loss with class weighting
	- Max Sequence Length: 4096 tokens
	- Batch Size: 64
	- Learning Rate: 2e-6

	### Training Hardware
	- 48 hours on 4x NVIDIA A100 40GB GPUs

	## Performance

	### Evaluation Results

	```
	YNACC:
	Accuracy: 0.63
	Precision: 0.63
	F1-Score: 0.65

	IAC:
	Accuracy: 0.79
	Precision: 0.85
	F1-Score: 0.87

	Reddit:
	Accuracy: 0.57
	Precision: 0.74
	F1-Score: 0.67
	```

	## Limitations and Bias

	- Language: English only
	- Bias: May reflect biases present in Reddit discussions and training data

	## Ethical Considerations

	- Human oversight is recommended for important moderation decisions

	## Technical Specifications

	- Model Architecture: ModernBERT + Classification Head
	- Parameters: ~150M base + LoRA adapters + classification head
	- Precision: 4-bit quantized base model with full-precision adapters
	- Framework: PyTorch, Transformers, PEFT (any recent version - you may see harmless warnings about configuration parameters)

	## Model Card Authors

	Niklas Koch, Georg August University of Göttingen

	## Model Card Contact

	[email protected]