--- pipeline_tag: fill-mask language: lat license: apache-2.0 tags: - trimmed library_name: transformers base_model: google/mt5-base base_model_relation: quantized datasets: - lbourdois/fineweb-2-trimming --- # mt5-base-lat-32768 This model is a **57.32% smaller** version of [google/mt5-base](https://huggingface.co/google/mt5-base) optimized for **Latin** language via vocabulary size reduction using the [trimming](https://huggingface.co/blog/lbourdois/introduction-to-trimming) method. This trimmed model should perform similarly to the original model with only 32,768 tokens and a much smaller memory footprint. However, it may not perform well for other languages as tokens not commonly used in the selected languages were removed from the vocabulary. ## Model Statistics | Metric | Original | Trimmed | Reduction | |--------|----------|---------|-----------| | **Vocabulary size** | 250,112 tokens | 32,768 tokens | **86.90%** | | **Model size** | 582,401,280 params | 248,560,896 params | **57.32%** | ![image](https://raw.githubusercontent.com/lbourdois/blog/refs/heads/master/assets/images/Trimming/mt5-base-32768.png) ## Mining Dataset Statistics - **Number of texts used for mining**: 200,000 texts - **Dataset**: [lbourdois/fineweb-2-trimming](https://huggingface.co/datasets/lbourdois/fineweb-2-trimming) ## Usage ```python from transformers import AutoModel, AutoTokenizer model_name = "alphaedge-ai/mt5-base-lat-32768" model = AutoModel.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) ``` ## Citations #### mT5 ``` @misc{xue2021mt5massivelymultilingualpretrained, title={mT5: A massively multilingual pre-trained text-to-text transformer}, author={Linting Xue and Noah Constant and Adam Roberts and Mihir Kale and Rami Al-Rfou and Aditya Siddhant and Aditya Barua and Colin Raffel}, year={2021}, eprint={2010.11934}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2010.11934}, } ``` #### Trimming blog post ``` @misc{hf_blogpost_trimming, title={Introduction to Trimming}, author={Loïck BOURDOIS and Tom AARSEN and Bram VANROY and Christopher AKIKI and Woojun JUNG and Manuel ROMERO and Prithiv SAKTHI}, year={2026}, url={https://huggingface.co/blog/lbourdois/introduction-to-trimming}, } ```