Fill-Mask
Transformers
Safetensors
roberta

BERnaT: Basque Encoders for Representing Natural Textual Diversity

Submitted to LREC 2026

Model Description

BERnaT is a family of monolingual Basque encoder-only language models trained to better represent linguistic variation—including standard, dialectal, historical, and informal Basque—rather than focusing solely on standard textual corpora. Models were trained on corpora that combine high-quality standard Basque with varied sources such as social media and historical texts, aiming to enhance robustness and generalization across natural language understanding (NLU) tasks.

  • Developed by: HiTZ Research Center & IXA Research group (University of the Basque Country UPV/EHU)
  • Funded by: Ikergaitu and ALIA projects (Basque and Spanish Government)
  • License: Apache 2.0
  • Model Type: Encoder-only Transformer models (RoBERTa-style)
  • Languages: Basque (Euskara)

Getting Started

You can either use this model directly as the example below, or fine-tune it to your task of interest.

>>> from transformers import pipeline

>>> pipe = pipeline("fill-mask", model='HiTZ/BERnaT-base')

>>> pipe("Kaixo! Ni <mask> naiz!")
[{'score': 0.022003261372447014,
  'token': 7497,
  'token_str': ' euskalduna',
  'sequence': 'Kaixo! Ni euskalduna naiz!'},
 {'score': 0.016429167240858078,
  'token': 14067,
  'token_str': ' Olentzero',
  'sequence': 'Kaixo! Ni Olentzero naiz!'},
 {'score': 0.012804778292775154,
  'token': 31087,
  'token_str': ' ahobizi',
  'sequence': 'Kaixo! Ni ahobizi naiz!'},
 {'score': 0.01173020526766777,
  'token': 331,
  'token_str': ' ez',
  'sequence': 'Kaixo! Ni ez naiz!'},
 {'score': 0.010091394186019897,
  'token': 7618,
  'token_str': ' irakaslea',
  'sequence': 'Kaixo! Ni irakaslea naiz!'}]

Training Data

The BERnaT family was pre-trained on a combination of:

  • Standard Basque corpora (e.g., Wikipedia, Egunkaria, EusCrawl).
  • Diverse corpora including Basque social media text and historical Basque books.
  • Combined corpora for the unified BERnaT models.

Training objective is masked language modeling (MLM) on encoder-only architectures across medium (51M), base (124M), and large (355M) sizes.

Evaluation

AVG standard tasks AVG diverse tasks AVG overall
BERnaT_standard
medium 74.10 70.30 72.58
base 75.33 71.26 73.70
large 76.83 73.13 75.35
BERnaT_diverse
medium 71.66 69.91 70.96
base 72.44 71.43 72.04
large 74.48 71.87 73.43
BERnaT
medium 73.56 70.59 72.37
base 75.42 71.28 73.76
large 77.88 73.77 76.24

Acknowledgments

This work has been partially supported by the Basque Government (Research group funding IT1570-22 and IKER-GAITU project), the Spanish Ministry for Digital Transformation and Civil Service, and the EU-funded NextGenerationEU Recovery, Transformation and Resilience Plan (ILENIA project, 2022/TL22/00215335; and ALIA project). The project also received funding from the European Union’s Horizon Europe research and innovation program under Grant Agreement No 101135724, Topic HORIZON-CL4-2023-HUMAN-01-21 and DeepKnowledge (PID2021-127777OB-C21) founded by MCIN/AEI/10.13039/501100011033 and FEDER. Jaione Bengoetxea, Julen Etxaniz and Ekhi Azurmendi hold a PhD grant from the Basque Government (PRE_2024_1_0028, PRE_2024_2_0028 and PRE_2024_1_0035, respectively). Maite Heredia and Mikel Zubillaga hold a PhD grant from the University of the Basque Country UPV/EHU (PIF23/218 and PIF24/04, respectively). The models were trained on the Leonardo supercomputer at CINECA under the EuroHPC Joint Undertaking, project EHPC-EXT-2024E01-042.

Citation:

To cite our work, please use:

@misc{azurmendi2025bernatbasqueencodersrepresenting,
      title={BERnaT: Basque Encoders for Representing Natural Textual Diversity}, 
      author={Ekhi Azurmendi and Joseba Fernandez de Landa and Jaione Bengoetxea and Maite Heredia and Julen Etxaniz and Mikel Zubillaga and Ander Soraluze and Aitor Soroa},
      year={2025},
      eprint={2512.03903},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.03903}, 
}
Downloads last month
19
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for HiTZ/BERnaT-base

Finetunes
13 models

Collection including HiTZ/BERnaT-base

Paper for HiTZ/BERnaT-base