File size: 11,198 Bytes
6ba84ce ad1ffed 6ba84ce c170ff1 5543518 c170ff1 5543518 c170ff1 5543518 c170ff1 5543518 c170ff1 6ba84ce c170ff1 6ba84ce e347f98 6ba84ce e347f98 6ba84ce |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 |
---
language:
- ca
tags:
- roberta
- fill-mask
- catalan
license: apache-2.0
library_name: transformers
---
# RoBERTa-ca Model Card
RoBERTa-ca is a new foundational Catalan language model built on the [RoBERTa](https://huggingface.co/FacebookAI/roberta-base) architecture. It uses vocabulary adaptation from [mRoBERTa](https://huggingface.co/BSC-LT/mRoBERTa), a method that initializes all weights from mRoBERTa while applying a specialized treatment to the embedding matrix. This treatment carefully handles the differences between the two tokenizers. The model is then continually pretrained using a Catalan-only corpus, consisting of 95GB of high-quality Catalan data.
## Technical Description
Technical details of the RoBERTa-ca model.
| Description | Value |
|-------------------------|:--------------|
| Model Parameters | 125M |
| Tokenizer Type | SPM |
| Vocabulary size | 50,304 |
| Precision | bfloat16 |
| Context length | 512 |
Training Hyperparemeters
| Hyperparameter | Value |
|------------------------- |:-------------- |
| Pretraining Objective | Masked Language Modeling |
| Learning Rate | 3E-05 |
| Learning Rate Scheduler | Cosine |
| Warmup | 2425 |
| Optimizer | AdamW |
| Optimizer Hyperparameters | AdamW (β1=0.9,β2=0.98,ε =1e-06 ) |
| Optimizer Decay | 1E-02 |
| Global Batch Size | 1024 |
| Dropout | 1E-01 |
| Attention Dropout | 1E-01 |
| Activation Function | GeLU |
## How to use
```python
>>> from transformers import pipeline
>>> from pprint import pprint
>>> unmasker = pipeline('fill-mask', model='BSC-LT/RoBERTa-ca')
>>> pprint(unmasker("M'encanta la<mask>de Barcelona.",top_k=3))
[{'score': 0.6109828948974609,
'sequence': "M'encanta la ciutat de Barcelona.",
'token': 1125,
'token_str': 'ciutat'},
{'score': 0.04469362273812294,
'sequence': "M'encanta la platja de Barcelona.",
'token': 5404,
'token_str': 'platja'},
{'score': 0.02249019406735897,
'sequence': "M'encanta la gent de Barcelona.",
'token': 1261,
'token_str': 'gent'}]
>>> pprint(unmasker("Adoro menjar un bon plat de<mask>al costat de la platja.",top_k=3))
[{'score': 0.12922883033752441,
'sequence': 'Adoro menjar un bon plat de peix al costat de la platja.',
'token': 5802,
'token_str': 'peix'},
{'score': 0.12800152599811554,
'sequence': 'Adoro menjar un bon plat de carn al costat de la platja.',
'token': 6432,
'token_str': 'carn'},
{'score': 0.06676974892616272,
'sequence': 'Adoro menjar un bon plat de marisc al costat de la platja.',
'token': 31717,
'token_str': 'marisc'}]
>>> pprint(unmasker("Intento anar a la platja de<mask>cada any, és fantástica.",top_k=3))
[{'score': 0.06159511208534241,
'sequence': 'Intento anar a la platja de Pals cada any, és fantástica.',
'token': 28365,
'token_str': 'Pals'},
{'score': 0.04985760524868965,
'sequence': 'Intento anar a la platja de Calella cada any, és fantástica.',
'token': 11472,
'token_str': 'Calella'},
{'score': 0.048444587737321854,
'sequence': 'Intento anar a la platja de Lloret cada any, és fantástica.',
'token': 11420,
'token_str': 'Lloret'}]
```
Which is equivalent to the following torch script:
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
model = AutoModelForMaskedLM.from_pretrained("BSC-LT/RoBERTa-ca")
tokenizer = AutoTokenizer.from_pretrained("BSC-LT/RoBERTa-ca")
# The index of "<mask>" token is -3 given that the -1 position is the EOS token "</s>" and -2 the position of the "." token.
outputs = model(**tokenizer("La capital d'Espanya és<mask>.", return_tensors="pt")).logits
predicted_token = tokenizer.decode(torch.argmax(outputs[0,-3,:]))
print(f"La predicció és \"{predicted_token}\"." ) # The prediction is "Madrid"
```
In most of the evaluations presented below, the model is adjusted to each use case using specific logits to encode the text.
### EVALUATION: CLUB Benchmark
Model performance in Catalan Language is assessed using the Catalan benchmark CLUB. CLUB [(Catalan Language Understanding Benchmark)](https://github.com/projecte-aina/club/tree/main) consists of 6 tasks: Named Entity Recognition (NER), Part-of-Speech Tagging (POS), Semantic Textual Similarity (STS), Text Classification (TC), Textual Entailment (TE), and Question Answering (QA). This benchmark evaluates the model's capabilities in the Catalan language.
The following base foundational models have been considered for the comparison:
| Multilingual Foundational Model | Number of Parameters | Vocab Size | Description |
|---------------------------------|----------------------|------------|-------------|
| [BERTa](https://huggingface.co/PlanTL-GOB-ES/roberta-base-ca) | 126M | 52K | BERTa is a Catalan-specific language model pretrained with Catalan-only data. |
| [BERTinho](https://huggingface.co/dvilares/bertinho-gl-base-cased) | 109M | 30K | BERTinho is monolingual BERT model for Galician language. |
| [mBERT](https://huggingface.co/google-bert/bert-base-multilingual-cased) | 178M | 120K | Multilingual BERT model pretrained on the top 104 languages with the largest Wikipedia. |
| [mRoBERTa](https://huggingface.co/BSC-LT/mRoBERTa) | 283M | 256K | RoBERTa base model pretrained with 35 European languages and a larger vocabulary size. |
| [roberta-base-bne](https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne) | 125M | 50K | RoBERTa base model pretrained with 570GB of data from web crawlings performed by the National Library of Spain from 2009 to 2019. |
| [RoBERTa-ca](https://huggingface.co/BSC-LT/RoBERTa-ca) | 125M | 50K | RoBERTa-ca is a Catalan-specific language model obtained by using vocabulary adaptation from mRoBERTa. |
| [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) | 279M | 250K | Foundational RoBERTa model pretrained with CommonCrawl data containing 100 languages. |
| [xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large) | 561M | 250K | Foundational RoBERTa model pretrained with CommonCrawl data containing 100 languages. |
<table>
<tr><th>tasks</th><th style=''>roberta-base-bne (125M)</th><th style=''>berta (126M)</th><th style=''>mBERT (178M)</th><th style=''>xlm-roberta-base (279M)</th><th style=''>xlm-roberta-large (561M)</th><th style=''>roberta-ca (125M)</th><th style=''>mRoBERTa (283M)</th></tr>
<tr><td>ner (F1)</td><td style=''>87.59</td><td style='text-decoration: underline;'>89.47</td><td style=''>85.89</td><td style=''>87.50</td><td style='text-decoration: underline;'>89.47</td><td style='font-weight: bold;'>89.70</td><td style=''>88.33</td></tr>
<tr><td>pos (F1)</td><td style=''>98.64</td><td style=''>98.89</td><td style=''>98.78</td><td style=''>98.91</td><td style='font-weight: bold;'>99.03</td><td style='text-decoration: underline;'>99.00</td><td style=''>98.98</td></tr>
<tr><td>sts (Person)</td><td style=''>74.27</td><td style=''>81.39</td><td style=''>77.05</td><td style=''>75.11</td><td style='font-weight: bold;'>83.49</td><td style='text-decoration: underline;'>82.99</td><td style=''>79.52</td></tr>
<tr><td>tc (Acc.)</td><td style='text-decoration: underline;'>73.86</td><td style=''>73.16</td><td style=''>72.00</td><td style=''>73.05</td><td style='font-weight: bold;'>74.10</td><td style=''>72.81</td><td style=''>72.41</td></tr>
<tr><td>te (Acc.)</td><td style=''>72.27</td><td style=''>80.11</td><td style=''>75.86</td><td style=''>78.27</td><td style='font-weight: bold;'>86.63</td><td style=''>82.14</td><td style='text-decoration: underline;'>82.38</td></tr>
<tr><td>viquiquad (F1)</td><td style=''>82.56</td><td style=''>86.74</td><td style=''>87.42</td><td style=''>86.81</td><td style='font-weight: bold;'>90.35</td><td style=''>87.31</td><td style='text-decoration: underline;'>87.86</td></tr>
<tr><td>xquad (F1)</td><td style=''>60.56</td><td style=''>67.38</td><td style=''>67.72</td><td style=''>68.56</td><td style='font-weight: bold;'>76.08</td><td style='text-decoration: underline;'>70.53</td><td style=''>69.40</td></tr>
</table>
## Additional information
### Author
The Language Technologies Lab from Barcelona Supercomputing Center.
### Contact
For further information, please send an email to <[email protected]>.
### Copyright
Copyright(c) 2025 by Language Technologies Lab, Barcelona Supercomputing Center.
### Funding
This work has been promoted and financed by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project [ILENIA](https://proyectoilenia.es/) with reference 2022/TL22/00215337.
### Acknowledgements
This project has benefited from the contributions of numerous teams and institutions through data contributions.
In Catalonia, many institutions have been involved in the project. Our thanks to Òmnium Cultural, Parlament de Catalunya, Institut d'Estudis Aranesos, Racó Català, Vilaweb, ACN, Nació Digital, El món and Aquí Berguedà.
At national level, we are especially grateful to our ILENIA project partners: CENID, HiTZ and CiTIUS for their participation. We also extend our genuine gratitude to the Spanish Senate and Congress, Fundación Dialnet, Fundación Elcano and the ‘Instituto Universitario de Sistemas Inteligentes y Aplicaciones Numéricas en Ingeniería (SIANI)’ of the University of Las Palmas de Gran Canaria.
At the international level, we thank the Welsh government, DFKI, Occiglot project, especially Malte Ostendorff, and The Common Crawl Foundation, especially Pedro Ortiz, for their collaboration.
Their valuable efforts have been instrumental in the development of this work.
### Disclaimer
Be aware that the model may contain biases or other unintended distortions.
When third parties deploy systems or provide services based on this model, or use the model themselves,
they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations,
including those governing the use of Artificial Intelligence.
The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use.
### License
[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0) |