XLM-RoBERTa Safety Classifier (Italian & English)
Model Description
This is an XLM-RoBERTa-based binary text classification model fine-tuned to detect toxicity and insults in user queries. It is trained on a bilingual dataset (Italian and English) to distinguish between SAFE (benign) and UNSAFE (toxic/harmful) inputs.
- Model Type: XLM-RoBERTa (Fine-tuned)
- Languages: Italian (
it), English (en) - Task: Binary Classification
- Training Dataset Size: 9,035 samples
- Created by: Famezz
Intended Use
This model is designed to act as a guardrail for Chatbots and LLMs. It can be used to:
- Filter out toxic user inputs before they reach a Large Language Model.
- Flag offensive content in user-generated text.
Label Mapping
The model is trained to predict the following string labels directly:
| Label | Description |
|---|---|
| SAFE | Benign queries, general knowledge, small talk. |
| UNSAFE | Toxic content, insults, offensive language. |
Usage
You can use this model directly with the Hugging Face pipeline. The pipeline will automatically output the labels "SAFE" or "UNSAFE".
from transformers import pipeline
# Load the classifier
classifier = pipeline("text-classification", model="Famezz/roberta_safety_classifier")
# Test with English
print(classifier("How do I bake a cake?"))
# Output: [{'label': 'SAFE', 'score': 0.99}]
# Test with Italian
print(classifier("Sei un idiota"))
# Output: [{'label': 'UNSAFE', 'score': 0.98}]
- Downloads last month
- 15
Model tree for Famezz/roberta_safety_classifier
Base model
FacebookAI/xlm-roberta-base