XLM-RoBERTa Safety Classifier (Italian & English)

Model Description

This is an XLM-RoBERTa-based binary text classification model fine-tuned to detect toxicity and insults in user queries. It is trained on a bilingual dataset (Italian and English) to distinguish between SAFE (benign) and UNSAFE (toxic/harmful) inputs.

  • Model Type: XLM-RoBERTa (Fine-tuned)
  • Languages: Italian (it), English (en)
  • Task: Binary Classification
  • Training Dataset Size: 9,035 samples
  • Created by: Famezz

Intended Use

This model is designed to act as a guardrail for Chatbots and LLMs. It can be used to:

  1. Filter out toxic user inputs before they reach a Large Language Model.
  2. Flag offensive content in user-generated text.

Label Mapping

The model is trained to predict the following string labels directly:

Label Description
SAFE Benign queries, general knowledge, small talk.
UNSAFE Toxic content, insults, offensive language.

Usage

You can use this model directly with the Hugging Face pipeline. The pipeline will automatically output the labels "SAFE" or "UNSAFE".

from transformers import pipeline

# Load the classifier
classifier = pipeline("text-classification", model="Famezz/roberta_safety_classifier")

# Test with English
print(classifier("How do I bake a cake?"))
# Output: [{'label': 'SAFE', 'score': 0.99}]

# Test with Italian
print(classifier("Sei un idiota"))
# Output: [{'label': 'UNSAFE', 'score': 0.98}]
Downloads last month
15
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Famezz/roberta_safety_classifier

Finetuned
(3705)
this model