πΉπ· Turkish Toxic Language Detection Model π§ π₯
This model is a fine-tuned version of dbmdz/bert-base-turkish-cased for binary toxicity classification in Turkish text. It was trained using a cleaned and preprocessed version of the Overfit-GM/turkish-toxic-language dataset. 
π Performance
| Metric | Non-Toxic | Toxic | Macro Avg | 
|---|---|---|---|
| Precision | 0.96 | 0.95 | 0.96 | 
| Recall | 0.95 | 0.96 | 0.96 | 
| F1-score | 0.96 | 0.96 | 0.96 | 
| Accuracy | 0.96 | ||
| Test Samples | 5400 | 5414 | 10814 | 
Confusion Matrix
| Pred: Non-Toxic | Pred: Toxic | |
|---|---|---|
| True: Non-Toxic | 5154 | 246 | 
| True: Toxic | 200 | 5214 | 
π§ͺ Preprocessing Details (cleaned_corrected_text)
The model is trained on the cleaned_corrected_text column, which is derived from corrected_text using basic regex-based cleaning steps and manual slang filtering. Here's how:
π§ Cleaning Function
def clean_corrected_text(text):
    text = text.lower()
    text = re.sub(r"http\S+|www\S+|https\S+", '', text, flags=re.MULTILINE)  # URL removal
    text = re.sub(r"@\w+", '', text)  # remove @mentions
    text = re.sub(r"[^\w\s.,!?-]", '', text)  # remove special characters (e.g., emojis)
    text = re.sub(r"\s+", ' ', text).strip()  # normalize whitespaces
    return text
π§Ή Manual Slang Filtering
slang_words = ["kanka", "lan", "knk", "bro", "la", "birader", "kanki"]
def remove_slang(text):
    for word in slang_words:
        text = text.replace(word, "")
    return text.strip()
β Applied Steps Summary
| Step | Description | 
|---|---|
| Lowercasing | All text is converted to lowercase | 
| URL removal | Removes links containing http, www, https | 
| Mention removal | Removes @username style mentions | 
| Special character removal | Removes emojis and symbols (π, *, %, $, ^, etc.) | 
| Whitespace normalization | Collapses multiple spaces into one | 
| Slang word removal | Removes common informal words like "kanka", "lan", etc. | 
π Conclusion: cleaned_corrected_text is a lightly cleaned, non-linguistically processed text column. The model is trained directly on this.
π‘ Example Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("fc63/turkish_toxic_language_detection_model")
model = AutoModelForSequenceClassification.from_pretrained("fc63/turkish_toxic_language_detection_model")
def predict_toxicity(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=128)
    outputs = model(**inputs)
    predicted = torch.argmax(outputs.logits, dim=1).item()
    return "Toxic" if predicted == 1 else "Non-Toxic"
π Training Details
- Trainer: Hugging Face TrainerAPI
- Epochs: 3
- Batch size: 16
- Learning Rate: 2e-5
- Eval Strategy: Epoch-based
- Undersampling: Applied to balance class distribution
π Dataset
Dataset used: Overfit-GM/turkish-toxic-language
Final dataset size after preprocessing and balancing: 54068 samples
- Downloads last month
- 44
Dataset used to train fc63/turkish-toxic-language-detection
Evaluation results
- Accuracy on Turkish Toxic Language Datasetself-reported0.960
- F1 on Turkish Toxic Language Datasetself-reported0.960
- Precision on Turkish Toxic Language Datasetself-reported0.960
- Recall on Turkish Toxic Language Datasetself-reported0.960
