--- pipeline_tag: fill-mask language: lat license: mit tags: - trimmed library_name: transformers base_model: jhu-clsp/mmBERT-base base_model_relation: quantized datasets: - lbourdois/fineweb-2-trimming --- # mmBERT-base-lat-16384 This model is a **59.96% smaller** version of [jhu-clsp/mmBERT-base](https://huggingface.co/jhu-clsp/mmBERT-base) optimized for **Latin** language via vocabulary size reduction using the [trimming](https://huggingface.co/blog/lbourdois/introduction-to-trimming) method. This trimmed model should perform similarly to the original model with only 16,384 tokens and a much smaller memory footprint. However, it may not perform well for other languages as tokens not commonly used in the selected languages were removed from the vocabulary. ## Model Statistics | Metric | Original | Trimmed | Reduction | |--------|----------|---------|-----------| | **Vocabulary size** | 256,000 tokens | 16,384 tokens | **93.60%** | | **Model size** | 306,939,648 params | 122,914,560 params | **59.96%** | ![image](https://raw.githubusercontent.com/lbourdois/blog/refs/heads/master/assets/images/Trimming/mmBERT-base-16384.png) ## Mining Dataset Statistics - **Number of texts used for mining**: 200,000 texts - **Dataset**: [lbourdois/fineweb-2-trimming](https://huggingface.co/datasets/lbourdois/fineweb-2-trimming) ## Usage ```python from transformers import AutoModel, AutoTokenizer model_name = "alphaedge-ai/mmBERT-base-lat-16384" model = AutoModel.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) ``` ## Citations #### mmBERT ``` @misc{marone2025mmbertmodernmultilingualencoder, title={mmBERT: A Modern Multilingual Encoder with Annealed Language Learning}, author={Marc Marone and Orion Weller and William Fleshman and Eugene Yang and Dawn Lawrie and Benjamin Van Durme}, year={2025}, eprint={2509.06888}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2509.06888}, } ``` #### Trimming blog post ``` @misc{hf_blogpost_trimming, title={Introduction to Trimming}, author={Loïck BOURDOIS and Tom AARSEN and Bram VANROY and Christopher AKIKI and Woojun JUNG and Manuel ROMERO and Prithiv SAKTHI}, year={2026}, url={https://huggingface.co/blog/lbourdois/introduction-to-trimming}, } ```