Turkish CEFR Classifier

A Turkish text classifier for CEFR (Common European Framework of Reference) language proficiency levels, trained on spoken Turkish data.

Fine-tuned from dbmdz/bert-base-turkish-cased on spoken Turkish YouTube transcripts.

Model Performance

Metric	Score
Macro F1	0.84
Accuracy	0.86

Per-Class Performance

Level	Precision	Recall	F1
A1	0.98	0.96	0.97
A2	0.89	0.82	0.85
B1	0.62	0.77	0.69
B2	0.76	0.71	0.73
C1	0.79	0.88	0.83
C2	1.00	0.95	0.97

Usage

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="crklih/turkish-cefr-classifier"
)

result = classifier("Osmanlı Devleti'nin çöküşü çok boyutlu bir süreçtir.")
# [{'label': 'B2', 'score': 0.95}]

CEFR Levels

Level	Description
A1	Beginner — basic words and simple phrases
A2	Elementary — simple sentences on everyday topics
B1	Intermediate — familiar topics, some complex structures
B2	Upper-Intermediate — complex sentences, abstract topics
C1	Advanced — academic/technical language, complex structures
C2	Proficient — near-native, nuanced, rare vocabulary

Training Data

Carefully curated Turkish spoken language phrases
Balanced distribution across all 6 CEFR levels
LLM-human agreement rate: 74.6%

Intended Use

Turkish language learning applications
Content difficulty assessment
Educational content filtering by proficiency level

Limitations

Trained on spoken Turkish — may underperform on formal written text
B1/B2 boundary can be ambiguous (lowest per-class F1)
Short phrases with technical vocabulary may be underestimated

Downloads last month: 47

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for crklih/turkish-cefr-classifier

Base model

dbmdz/bert-base-turkish-cased

Finetuned

(153)

this model