Multilingual Educational Content Classifier
Trained on full documents of up to 8192 tokens in total. The train set of tartuNLP/fineweb-c-combined-resample was used, which itself is a mix and a resample of HuggingFaceFW/fineweb-edu-llama3-annotations and data-is-better-together/fineweb-c.
Labels
{0: '❗ Problematic Content ❗', 1: 'None', 2: 'Minimal', 3: 'Basic', 4: 'Good', 5: 'Excellent'}
Classification Report
Evaluated on the development set of tartuNLP/fineweb-c-combined-resample organized so that each language appears at least once.
precision recall f1-score support
0 0.89 0.78 0.83 602
1 0.65 0.88 0.75 916
2 0.41 0.29 0.34 345
3 0.40 0.30 0.34 179
4 0.53 0.15 0.23 127
5 0.55 0.39 0.45 44
accuracy 0.66 2213
macro avg 0.57 0.46 0.49 2213
weighted avg 0.65 0.66 0.64 2213
Confusion Matrix
[[471 114 10 6 0 1]
[ 33 806 59 13 5 0]
[ 10 204 101 28 2 0]
[ 7 72 37 53 8 2]
[ 7 35 27 28 19 11]
[ 2 7 10 6 2 17]]
- Downloads last month
- 5
Model tree for tartuNLP/mmBERT-small-m-edu-classifier
Base model
jhu-clsp/mmBERT-small