--- language: - cs - da - de - en - hu - nl - pl metrics: - perplexity base_model: - FacebookAI/xlm-roberta-large pipeline_tag: fill-mask library_name: transformers license: mit new_version: ChrisBridges/xlm-r-malach-v4 tags: - holocaust - speech - history - historical - war --- # XLM-RoBERTa-Malach-v2 Version 2 of XLM-RoBERTa-large with continued pretraining on speech transcriptions of the [Visual History Archive](https://vha.usc.edu/home). Part 1 of the used training data is ASR, Part 2 is machine translated from Part 1 using [MADLAD-400-3B-MT](https://huggingface.co/google/madlad400-3b-mt). **Possibly overfitting. Use the [new version](https://huggingface.co/ChrisBridges/xlm-r-malach-v4) instead.** ## Training Data **ASR data:** cs, de, en, nl **MT data:** cs, da, de, en, hu, nl, pl **Total tokens:** 4.9B **Training tokens:** 4.4B **Test tokens:** 490M Danish, Hungarian, and Polish ASR data are not yet available. The same documents are used in all 7 languages, but their proportions in number of tokens might differ. A random split of 10% is used as a test dataset, preserving the language proportions of the training data. The test set has been masked with 15% probability. The data preprocessing (reading, tokenization, concatenation, splitting, and masking of the test dataset) takes around 2.5 hours per language using 8 CPUs. ## Training Details Parameters are mostly replicated from \[1\] Appendix B: AdamW with eps=1e-6, beta1=0.9, beta2=0.98, weight decay=0.01, learning rate=1e-4 with linear schedule and linear warmup for 6% of the first training steps. Trained with dynamic masking on 8 A100s with per-device batch size 8, using 32 gradient accumulation steps for an effective batch size of 2048, for 1 epoch (33,622 steps) on an MLM objective. Main differences from XLM-RoBERTa-large: AdamW instead of Adam, effective batch size 2048 instead of 8192, and 34k steps instead of 500k due to a smaller dataset. Smaller learning rate, since 4e-4 lead to overfitting and increased the perplexity to 233.4169. This somewhat aligns with \[2\] and \[3\], who continue the pretraining on small data. The training takes around 9 hours. ## Evaluation Since the model sees translations of evaluation samples during the training, an additional domain-specific dataset has been prepared for unbiased evaluation. For this dataset, sentences have been extracted from a NER dataset based on the [EHRI Online Editions](https://www.ehri-project.eu/ehri-online-editions/) in 9 languages, not including Danish \[4\]. It is split into two evaluation datasets EHRI-6 (714k tokens) and EHRI-9 (877k tokens), the latter one including 3 unseen languages. **Perplexity (42M token test set):** 3.0177 -> 2.2252 **Perplexity (490M token test set):** 2.5179 -> 1.7540 **Perplexity (EHRI-6):** 3.2081 -> 3.2472 **Perplexity (EHRI-9):** 3.2696 -> 10.7375 Improvement from the XLM-RoBERTa-large checkpoint. The larger test is split from the dataset used to train this model. ## References \[1\] [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/pdf/1907.11692) \[2\] [Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks](https://aclanthology.org/2020.acl-main.740/) \[3\] [The ParlaSent Multilingual Training Dataset for Sentiment Identification in Parliamentary Proceedings](https://arxiv.org/abs/2309.09783) \[4\] [Repurposing Holocaust-Related Digital Scholarly Editions to Develop Multilingual Domain-Specific Named Entity Recognition Tools](https://aclanthology.org/2024.htres-1.3/)