---
language:
- cs
- da
- de
- en
- hu
- nl
- pl
metrics:
- perplexity
base_model:
- FacebookAI/xlm-roberta-large
pipeline_tag: fill-mask
library_name: transformers
license: mit
new_version: ChrisBridges/xlm-r-malach-v4
tags:
- holocaust
- speech
- history
- historical
- war
---

# XLM-RoBERTa-Malach-v2

Version 2 of XLM-RoBERTa-large with continued pretraining on speech transcriptions of the [Visual History Archive](https://vha.usc.edu/home).
Part 1 of the used training data is ASR, Part 2 is machine translated from Part 1 using [MADLAD-400-3B-MT](https://huggingface.co/google/madlad400-3b-mt). 

**Possibly overfitting. Use the [new version](https://huggingface.co/ChrisBridges/xlm-r-malach-v4) instead.**

## Training Data

**ASR data:** cs, de, en, nl  
**MT data:** cs, da, de, en, hu, nl, pl

**Total tokens:**  4.9B  
**Training tokens:** 4.4B  
**Test tokens:**  490M

Danish, Hungarian, and Polish ASR data are not yet available.
The same documents are used in all 7 languages, but their proportions in number of tokens might differ.
A random split of 10% is used as a test dataset, preserving the language proportions of the training data. The test set has been masked with 15% probability.

The data preprocessing (reading, tokenization, concatenation, splitting, and masking of the test dataset) takes around 2.5 hours per language using 8 CPUs.

## Training Details

Parameters are mostly replicated from \[1\] Appendix B:  
AdamW with eps=1e-6, beta1=0.9, beta2=0.98, weight decay=0.01, learning rate=1e-4 with linear schedule and linear warmup for 6% of the first training steps.
Trained with dynamic masking on 8 A100s with per-device batch size 8, using 32 gradient accumulation steps for an effective batch size of 2048, for 1 epoch (33,622 steps) on an MLM objective.  

Main differences from XLM-RoBERTa-large:  
AdamW instead of Adam, effective batch size 2048 instead of 8192, and 34k steps instead of 500k due to a smaller dataset.
Smaller learning rate, since 4e-4 lead to overfitting and increased the perplexity to 233.4169.
This somewhat aligns with \[2\] and \[3\], who continue the pretraining on small data.  

The training takes around 9 hours.

## Evaluation

Since the model sees translations of evaluation samples during the training, an additional domain-specific dataset has been prepared for unbiased evaluation.
For this dataset, sentences have been extracted from a NER dataset based on the [EHRI Online Editions](https://www.ehri-project.eu/ehri-online-editions/) in 9 languages, not including Danish \[4\].
It is split into two evaluation datasets EHRI-6 (714k tokens) and EHRI-9 (877k tokens), the latter one including 3 unseen languages.

**Perplexity (42M token test set):** 3.0177 -> 2.2252  
**Perplexity (490M token test set):** 2.5179 -> 1.7540  
**Perplexity (EHRI-6):** 3.2081 -> 3.2472  
**Perplexity (EHRI-9):** 3.2696 -> 10.7375

Improvement from the XLM-RoBERTa-large checkpoint. The larger test is split from the dataset used to train this model.

## References
\[1\] [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/pdf/1907.11692)  
\[2\] [Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks](https://aclanthology.org/2020.acl-main.740/)  
\[3\] [The ParlaSent Multilingual Training Dataset for Sentiment Identification in Parliamentary Proceedings](https://arxiv.org/abs/2309.09783)  
\[4\] [Repurposing Holocaust-Related Digital Scholarly Editions to Develop Multilingual Domain-Specific Named Entity Recognition Tools](https://aclanthology.org/2024.htres-1.3/)