---
license: mit
datasets:
- cerebras/SlimPajama-627B
language:
- en
metrics:
- accuracy
- f1
base_model:
- answerdotai/ModernBERT-base
pipeline_tag: text-classification
library_name: transformers
---

# Readability Rating Model

This repository contains the model described in the paper [Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models](https://huggingface.co/papers/2504.14194).

Code: https://github.com/opendatalab/Meta-rater

## Model Description

This model is a fine-tuned version of ModernBERT-base designed to evaluate the **Readability** dimension of text quality on a 5-point scale (0-5). Readability measures the ease with which a reader can understand a written text, considering factors such as clarity, coherence, vocabulary complexity, and sentence structure.

## Model Details

- **Base Model**: ModernBERT-base
- **Parameters**: 149M
- **Context Window**: 4,096 tokens
- **Task**: Text quality rating (regression)
- **Score Range**: 0-5 (continuous)
- **Performance**: 87.47% F1 score, 94.13% accuracy

## Rating Scale

The model uses an additive 5-point rating system:

- **0**: absolute not readable
- **1**: Somewhat readable but contains significant clarity or coherence issues, complex vocabulary, or numerous errors
- **2**: Generally clear and coherent with occasional grammar, spelling errors, or convoluted structures
- **3**: Clear and coherent for the most part, using appropriate vocabulary with minor grammar/spelling issues
- **4**: Very clear and coherent with very few or no errors, proper punctuation and easy-to-follow structures
- **5**: Outstanding clarity and coherence, effective communication with minimal errors that don't interfere with understanding

## Usage

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load the model and tokenizer
model_name = "opendatalab/meta-rater-readability-rating"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example text
text = "The weather today is sunny and warm. It's a perfect day for outdoor activities."

# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096)
with torch.no_grad():
    outputs = model(**inputs)
    score = outputs.logits.squeeze().argmax(dim=0)

print(f"Readability Score: {score:.2f}")
```

## Training Details

- **Training Data**: 747,422 examples from SlimPajama dataset
- **Annotation Model**: Llama-3.3-70B-Instruct
- **Training Epochs**: 10
- **Evaluation Split**: 93,428 test examples
- **Data Split**: 8:1:1 (train:dev:test)

## Applications

This model is particularly useful for:
- **Content editing** and proofreading assistance
- **Educational material** assessment for appropriate reading levels
- **Web content optimization** for user experience
- **Data curation** for language model training focusing on well-written text
- **Accessibility evaluation** for diverse reading audiences
- **Writing quality assessment** tools

## What the Model Evaluates

The model considers several linguistic factors:
- **Sentence structure** complexity and clarity
- **Vocabulary** appropriateness and accessibility
- **Grammar and spelling** accuracy
- **Text coherence** and logical flow
- **Punctuation** usage and effectiveness

## What the Model Does NOT Consider

- The specific language the text is written in
- The length of the text
- Usage of placeholders for data privacy or safety
- Content topic or subject matter

## Limitations

- Designed primarily for English text
- May not capture domain-specific readability requirements
- Performance may vary for highly technical or specialized content
- Should be used as one factor among others in comprehensive text quality assessment

## Citation

If you use this model in your research, please cite:

```bibtex
@article{zhuang2025meta,
  title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
  author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
  journal={arXiv preprint arXiv:2504.14194},
  year={2025}
}
```

## License

This model is released under the same license as the base ModernBERT model.

## Contact

For questions or issues, please contact the authors or open an issue in the repository.