Text Classification
Transformers
Safetensors
modernbert

Multilingual Educational Content Classifier

Trained on full documents of up to 8192 tokens in total. The train set of tartuNLP/fineweb-c-combined-resample was used, which itself is a mix and a resample of HuggingFaceFW/fineweb-edu-llama3-annotations and data-is-better-together/fineweb-c.

Labels

{0: '❗ Problematic Content ❗', 1: 'None', 2: 'Minimal', 3: 'Basic', 4: 'Good', 5: 'Excellent'}

Classification Report

Evaluated on the development set of tartuNLP/fineweb-c-combined-resample organized so that each language appears at least once.

              precision    recall  f1-score   support

           0       0.89      0.78      0.83       602
           1       0.65      0.88      0.75       916
           2       0.41      0.29      0.34       345
           3       0.40      0.30      0.34       179
           4       0.53      0.15      0.23       127
           5       0.55      0.39      0.45        44

    accuracy                           0.66      2213
   macro avg       0.57      0.46      0.49      2213
weighted avg       0.65      0.66      0.64      2213

Confusion Matrix

[[471 114  10   6   0   1]
 [ 33 806  59  13   5   0]
 [ 10 204 101  28   2   0]
 [  7  72  37  53   8   2]
 [  7  35  27  28  19  11]
 [  2   7  10   6   2  17]]
Downloads last month
5
Safetensors
Model size
0.1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tartuNLP/mmBERT-small-m-edu-classifier

Finetuned
(11)
this model

Datasets used to train tartuNLP/mmBERT-small-m-edu-classifier