Turkish CEFR Classifier

A Turkish text classifier for CEFR (Common European Framework of Reference) language proficiency levels, trained on spoken Turkish data.

Fine-tuned from dbmdz/bert-base-turkish-cased on spoken Turkish YouTube transcripts.

Model Performance

Metric Score
Macro F1 0.84
Accuracy 0.86

Per-Class Performance

Level Precision Recall F1
A1 0.98 0.96 0.97
A2 0.89 0.82 0.85
B1 0.62 0.77 0.69
B2 0.76 0.71 0.73
C1 0.79 0.88 0.83
C2 1.00 0.95 0.97

Usage

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="crklih/turkish-cefr-classifier"
)

result = classifier("Osmanlı Devleti'nin çöküşü çok boyutlu bir süreçtir.")
# [{'label': 'B2', 'score': 0.95}]

CEFR Levels

Level Description
A1 Beginner — basic words and simple phrases
A2 Elementary — simple sentences on everyday topics
B1 Intermediate — familiar topics, some complex structures
B2 Upper-Intermediate — complex sentences, abstract topics
C1 Advanced — academic/technical language, complex structures
C2 Proficient — near-native, nuanced, rare vocabulary

Training Data

  • Carefully curated Turkish spoken language phrases
  • Balanced distribution across all 6 CEFR levels
  • LLM-human agreement rate: 74.6%

Intended Use

  • Turkish language learning applications
  • Content difficulty assessment
  • Educational content filtering by proficiency level

Limitations

  • Trained on spoken Turkish — may underperform on formal written text
  • B1/B2 boundary can be ambiguous (lowest per-class F1)
  • Short phrases with technical vocabulary may be underestimated
Downloads last month
47
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for crklih/turkish-cefr-classifier

Finetuned
(153)
this model