Text Classification
Transformers
Safetensors
English
modernbert
renma's picture
Update README.md
5bfbee1 verified
metadata
license: mit
datasets:
  - cerebras/SlimPajama-627B
language:
  - en
metrics:
  - accuracy
  - f1
base_model:
  - answerdotai/ModernBERT-base
pipeline_tag: text-classification
library_name: transformers

Readability Rating Model

This repository contains the model described in the paper Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models.

Code: https://github.com/opendatalab/Meta-rater

Model Description

This model is a fine-tuned version of ModernBERT-base designed to evaluate the Readability dimension of text quality on a 5-point scale (0-5). Readability measures the ease with which a reader can understand a written text, considering factors such as clarity, coherence, vocabulary complexity, and sentence structure.

Model Details

  • Base Model: ModernBERT-base
  • Parameters: 149M
  • Context Window: 4,096 tokens
  • Task: Text quality rating (regression)
  • Score Range: 0-5 (continuous)
  • Performance: 87.47% F1 score, 94.13% accuracy

Rating Scale

The model uses an additive 5-point rating system:

  • 0: absolute not readable
  • 1: Somewhat readable but contains significant clarity or coherence issues, complex vocabulary, or numerous errors
  • 2: Generally clear and coherent with occasional grammar, spelling errors, or convoluted structures
  • 3: Clear and coherent for the most part, using appropriate vocabulary with minor grammar/spelling issues
  • 4: Very clear and coherent with very few or no errors, proper punctuation and easy-to-follow structures
  • 5: Outstanding clarity and coherence, effective communication with minimal errors that don't interfere with understanding

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load the model and tokenizer
model_name = "opendatalab/meta-rater-readability-rating"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example text
text = "The weather today is sunny and warm. It's a perfect day for outdoor activities."

# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096)
with torch.no_grad():
    outputs = model(**inputs)
    score = outputs.logits.squeeze().argmax(dim=0)

print(f"Readability Score: {score:.2f}")

Training Details

  • Training Data: 747,422 examples from SlimPajama dataset
  • Annotation Model: Llama-3.3-70B-Instruct
  • Training Epochs: 10
  • Evaluation Split: 93,428 test examples
  • Data Split: 8:1:1 (train:dev:test)

Applications

This model is particularly useful for:

  • Content editing and proofreading assistance
  • Educational material assessment for appropriate reading levels
  • Web content optimization for user experience
  • Data curation for language model training focusing on well-written text
  • Accessibility evaluation for diverse reading audiences
  • Writing quality assessment tools

What the Model Evaluates

The model considers several linguistic factors:

  • Sentence structure complexity and clarity
  • Vocabulary appropriateness and accessibility
  • Grammar and spelling accuracy
  • Text coherence and logical flow
  • Punctuation usage and effectiveness

What the Model Does NOT Consider

  • The specific language the text is written in
  • The length of the text
  • Usage of placeholders for data privacy or safety
  • Content topic or subject matter

Limitations

  • Designed primarily for English text
  • May not capture domain-specific readability requirements
  • Performance may vary for highly technical or specialized content
  • Should be used as one factor among others in comprehensive text quality assessment

Citation

If you use this model in your research, please cite:

@article{zhuang2025meta,
  title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
  author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
  journal={arXiv preprint arXiv:2504.14194},
  year={2025}
}

License

This model is released under the same license as the base ModernBERT model.

Contact

For questions or issues, please contact the authors or open an issue in the repository.