Multilingual Educational Content Classifier

Trained on full documents of up to 8192 tokens in total. The train set of tartuNLP/fineweb-c-combined-resample was used, which itself is a mix and a resample of HuggingFaceFW/fineweb-edu-llama3-annotations and data-is-better-together/fineweb-c.

Labels

{0: '❗ Problematic Content ❗', 1: 'None', 2: 'Minimal', 3: 'Basic', 4: 'Good', 5: 'Excellent'}

Classification Report

Evaluated on the development set of tartuNLP/fineweb-c-combined-resample organized so that each language appears at least once.

              precision    recall  f1-score   support

           0       0.89      0.78      0.83       602
           1       0.65      0.88      0.75       916
           2       0.41      0.29      0.34       345
           3       0.40      0.30      0.34       179
           4       0.53      0.15      0.23       127
           5       0.55      0.39      0.45        44

    accuracy                           0.66      2213
   macro avg       0.57      0.46      0.49      2213
weighted avg       0.65      0.66      0.64      2213

Confusion Matrix

[[471 114  10   6   0   1]
 [ 33 806  59  13   5   0]
 [ 10 204 101  28   2   0]
 [  7  72  37  53   8   2]
 [  7  35  27  28  19  11]
 [  2   7  10   6   2  17]]

Downloads last month: 5

Safetensors

Model size

0.1B params

Tensor type

BF16

Model tree for tartuNLP/mmBERT-small-m-edu-classifier

Base model

jhu-clsp/mmBERT-small

Finetuned

(11)

this model

tartuNLP
/

mmBERT-small-m-edu-classifier

Multilingual Educational Content Classifier

Labels

Classification Report

Confusion Matrix

Model tree for tartuNLP/mmBERT-small-m-edu-classifier

Datasets used to train tartuNLP/mmBERT-small-m-edu-classifier