--- license: mit datasets: - cerebras/SlimPajama-627B language: - en metrics: - accuracy - f1 base_model: - answerdotai/ModernBERT-base pipeline_tag: text-classification library_name: transformers --- # Readability Rating Model This repository contains the model described in the paper [Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models](https://huggingface.co/papers/2504.14194). Code: https://github.com/opendatalab/Meta-rater ## Model Description This model is a fine-tuned version of ModernBERT-base designed to evaluate the **Readability** dimension of text quality on a 5-point scale (0-5). Readability measures the ease with which a reader can understand a written text, considering factors such as clarity, coherence, vocabulary complexity, and sentence structure. ## Model Details - **Base Model**: ModernBERT-base - **Parameters**: 149M - **Context Window**: 4,096 tokens - **Task**: Text quality rating (regression) - **Score Range**: 0-5 (continuous) - **Performance**: 87.47% F1 score, 94.13% accuracy ## Rating Scale The model uses an additive 5-point rating system: - **0**: absolute not readable - **1**: Somewhat readable but contains significant clarity or coherence issues, complex vocabulary, or numerous errors - **2**: Generally clear and coherent with occasional grammar, spelling errors, or convoluted structures - **3**: Clear and coherent for the most part, using appropriate vocabulary with minor grammar/spelling issues - **4**: Very clear and coherent with very few or no errors, proper punctuation and easy-to-follow structures - **5**: Outstanding clarity and coherence, effective communication with minimal errors that don't interfere with understanding ## Usage ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch # Load the model and tokenizer model_name = "opendatalab/meta-rater-readability-rating" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) # Example text text = "The weather today is sunny and warm. It's a perfect day for outdoor activities." # Tokenize and predict inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096) with torch.no_grad(): outputs = model(**inputs) score = outputs.logits.squeeze().argmax(dim=0) print(f"Readability Score: {score:.2f}") ``` ## Training Details - **Training Data**: 747,422 examples from SlimPajama dataset - **Annotation Model**: Llama-3.3-70B-Instruct - **Training Epochs**: 10 - **Evaluation Split**: 93,428 test examples - **Data Split**: 8:1:1 (train:dev:test) ## Applications This model is particularly useful for: - **Content editing** and proofreading assistance - **Educational material** assessment for appropriate reading levels - **Web content optimization** for user experience - **Data curation** for language model training focusing on well-written text - **Accessibility evaluation** for diverse reading audiences - **Writing quality assessment** tools ## What the Model Evaluates The model considers several linguistic factors: - **Sentence structure** complexity and clarity - **Vocabulary** appropriateness and accessibility - **Grammar and spelling** accuracy - **Text coherence** and logical flow - **Punctuation** usage and effectiveness ## What the Model Does NOT Consider - The specific language the text is written in - The length of the text - Usage of placeholders for data privacy or safety - Content topic or subject matter ## Limitations - Designed primarily for English text - May not capture domain-specific readability requirements - Performance may vary for highly technical or specialized content - Should be used as one factor among others in comprehensive text quality assessment ## Citation If you use this model in your research, please cite: ```bibtex @article{zhuang2025meta, title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models}, author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui}, journal={arXiv preprint arXiv:2504.14194}, year={2025} } ``` ## License This model is released under the same license as the base ModernBERT model. ## Contact For questions or issues, please contact the authors or open an issue in the repository.