Turkish CEFR Classifier
A Turkish text classifier for CEFR (Common European Framework of Reference) language proficiency levels, trained on spoken Turkish data.
Fine-tuned from dbmdz/bert-base-turkish-cased on spoken Turkish YouTube transcripts.
Model Performance
| Metric | Score |
|---|---|
| Macro F1 | 0.84 |
| Accuracy | 0.86 |
Per-Class Performance
| Level | Precision | Recall | F1 |
|---|---|---|---|
| A1 | 0.98 | 0.96 | 0.97 |
| A2 | 0.89 | 0.82 | 0.85 |
| B1 | 0.62 | 0.77 | 0.69 |
| B2 | 0.76 | 0.71 | 0.73 |
| C1 | 0.79 | 0.88 | 0.83 |
| C2 | 1.00 | 0.95 | 0.97 |
Usage
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="crklih/turkish-cefr-classifier"
)
result = classifier("Osmanlı Devleti'nin çöküşü çok boyutlu bir süreçtir.")
# [{'label': 'B2', 'score': 0.95}]
CEFR Levels
| Level | Description |
|---|---|
| A1 | Beginner — basic words and simple phrases |
| A2 | Elementary — simple sentences on everyday topics |
| B1 | Intermediate — familiar topics, some complex structures |
| B2 | Upper-Intermediate — complex sentences, abstract topics |
| C1 | Advanced — academic/technical language, complex structures |
| C2 | Proficient — near-native, nuanced, rare vocabulary |
Training Data
- Carefully curated Turkish spoken language phrases
- Balanced distribution across all 6 CEFR levels
- LLM-human agreement rate: 74.6%
Intended Use
- Turkish language learning applications
- Content difficulty assessment
- Educational content filtering by proficiency level
Limitations
- Trained on spoken Turkish — may underperform on formal written text
- B1/B2 boundary can be ambiguous (lowest per-class F1)
- Short phrases with technical vocabulary may be underestimated
- Downloads last month
- 47
Model tree for crklih/turkish-cefr-classifier
Base model
dbmdz/bert-base-turkish-cased