AraBERTv2+D3Tok+Reg Readability Model

Model description

AraBERTv2+D3Tok+Reg is a readability assessment model that was built by fine-tuning the AraBERTv2 model with Mean Squared Error loss (Reg). For the fine-tuning, we used the D3Tok input variant from BAREC-Corpus-v1.0. Our fine-tuning procedure and the hyperparameters we used can be found in our paper "A Large and Balanced Corpus for Fine-grained Arabic Readability Assessment."

Intended uses

You can use the AraBERTv2+D3Tok+Reg model as part of the transformers pipeline. You need to preprocess your text into the D3Tok input variant using the preprocessing step here.

How to use

To use the model:

from transformers import pipeline
readability = pipeline("text-classification", model="CAMeL-Lab/readability-arabertv2-d3tok-reg")
with open("/PATH/TO/preprocessed_d3tok", "r") as f:
    sentences = f.read().split("\n")
results = readability(sentences, function_to_apply="none")
readability_levels = [max(round(result['score']+0.5),1) for result in results]

Citation

@inproceedings{elmadani-etal-2025-readability,
    title = "A Large and Balanced Corpus for Fine-grained Arabic Readability Assessment",
    author = "Elmadani, Khalid N.  and
      Habash, Nizar  and
      Taha-Thomure, Hanada",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics"
}