Updating README

611bf5e verified 18 days ago

6.41 kB

	---
	language:
	- ku
	- ckb
	inference: false
	tags:
	- T5
	- t5
	- HPLT
	- encoder-decoder
	- text2text-generation
	license: apache-2.0
	datasets:
	- HPLT/HPLT3.0
	---

	# HPLT v3.0 T5 for Central Kurdish (Kurdish, Central)

	<img src="https://hplt-project.org/_next/static/media/logo-hplt.d5e16ca5.svg" width=12.5%>

	This is one of the encoder-decoder monolingual language models trained as a third release by the [HPLT project](https://hplt-project.org/).
	It is a text-to-text transformer trained with a denoising objective. Our
	models follow the setup of [NorT5](https://aclanthology.org/2023.nodalida-1.61/).

	We present monolingual NorT5 models for 57 languages out of 198 total in the [HPLT v3.0 dataset](https://hplt-project.org/datasets/v3.0).

	All the HPLT encoder-decoder models use the same hyper-parameters, roughly following the T5-base setup:
	- hidden size: 768
	- attention heads: 12
	- layers: 12
	- vocabulary size: 32768

	Every model uses its own tokenizer trained on language-specific HPLT data.

	[The training code](https://github.com/hplt-project/HPLT-WP4).

	## Example usage

	This model currently needs a custom wrapper from `modeling_nort5.py`, you should therefore load the model with `trust_remote_code=True`.

	```
	pip install transformers==4.46.1
	```

	```python
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	model_path = 'HPLT/hplt_t5_base_3_0_nob_Latn'
	model = AutoModelForSeq2SeqLM.from_pretrained(
	model_path, trust_remote_code=True, use_safetensors=False,
	)
	tokenizer = AutoTokenizer.from_pretrained(model_path)
	# MASKED LANGUAGE MODELING
	sentence = "Ansiktsuttrykket duckface har [MASK_1] seg til et utbredt kulturelt fenomen."
	encoding = tokenizer(sentence, return_tensors="pt")
	mask_1 = tokenizer.convert_tokens_to_ids("[MASK_1]")
	mask_2 = tokenizer.convert_tokens_to_ids("[MASK_2]")
	output_tensor = model.generate(
	encoding.input_ids,
	decoder_start_token_id=mask_1,
	eos_token_id=mask_2,
	)
	print(tokenizer.decode(output_tensor.squeeze(), skip_special_tokens=False))
	# should output: '[MASK_1]utviklet[MASK_2]'
	```

	## Intermediate checkpoints

	We are releasing 10 intermediate checkpoints for each model at intervals of every 3125 training steps in separate branches. The naming convention is `stepXXX`: for example, `step18750`.

	You can load a specific model revision with `transformers` using the argument `revision`:
	```python
	model = AutoModelForSeq2SeqLM.from_pretrained("HPLT/hplt_t5_base_3_0_ckb_Arab", revision="step21875", trust_remote_code=True)
	```

	You can access all the revisions for the models with the following code:
	```python
	from huggingface_hub import list_repo_refs
	out = list_repo_refs("HPLT/hplt_t5_base_3_0_ckb_Arab")
	print([b.name for b in out.branches])
	```

	## Cite us

	```bibtex
	@inproceedings{samuel-etal-2023-norbench,
	title = "{N}or{B}ench {--} A Benchmark for {N}orwegian Language Models",
	author = "Samuel, David and
	Kutuzov, Andrey and
	Touileb, Samia and
	Velldal, Erik and
	{\O}vrelid, Lilja and
	R{\o}nningstad, Egil and
	Sigdel, Elina and
	Palatkina, Anna",
	editor = {Alum{\"a}e, Tanel and
	Fishel, Mark},
	booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)",
	month = may,
	year = "2023",
	address = "T{\'o}rshavn, Faroe Islands",
	publisher = "University of Tartu Library",
	url = "https://aclanthology.org/2023.nodalida-1.61/",
	pages = "618--633",
	abstract = "We present NorBench: a streamlined suite of NLP tasks and probes for evaluating Norwegian language models (LMs) on standardized data splits and evaluation metrics. We also introduce a range of new Norwegian language models (both encoder and encoder-decoder based). Finally, we compare and analyze their performance, along with other existing LMs, across the different benchmark tests of NorBench."
	}
	```

	```bibtex
	@inproceedings{burchell-etal-2025-expanded,
	title = "An Expanded Massive Multilingual Dataset for High-Performance Language Technologies ({HPLT})",
	author = {Burchell, Laurie and
	de Gibert, Ona and
	Arefyev, Nikolay and
	Aulamo, Mikko and
	Ba{\~n}{\'o}n, Marta and
	Chen, Pinzhen and
	Fedorova, Mariia and
	Guillou, Liane and
	Haddow, Barry and
	Haji{\v{c}}, Jan and
	Helcl, Jind{\v{r}}ich and
	Henriksson, Erik and
	Klimaszewski, Mateusz and
	Komulainen, Ville and
	Kutuzov, Andrey and
	Kyt{\"o}niemi, Joona and
	Laippala, Veronika and
	M{\ae}hlum, Petter and
	Malik, Bhavitvya and
	Mehryary, Farrokh and
	Mikhailov, Vladislav and
	Moghe, Nikita and
	Myntti, Amanda and
	O{'}Brien, Dayy{\'a}n and
	Oepen, Stephan and
	Pal, Proyag and
	Piha, Jousia and
	Pyysalo, Sampo and
	Ram{\'i}rez-S{\'a}nchez, Gema and
	Samuel, David and
	Stepachev, Pavel and
	Tiedemann, J{\"o}rg and
	Vari{\v{s}}, Du{\v{s}}an and
	Vojt{\v{e}}chov{\'a}, Tereza and
	Zaragoza-Bernabeu, Jaume},
	editor = "Che, Wanxiang and
	Nabende, Joyce and
	Shutova, Ekaterina and
	Pilehvar, Mohammad Taher",
	booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
	month = jul,
	year = "2025",
	address = "Vienna, Austria",
	publisher = "Association for Computational Linguistics",
	url = "https://aclanthology.org/2025.acl-long.854/",
	doi = "10.18653/v1/2025.acl-long.854",
	pages = "17452--17485",
	ISBN = "979-8-89176-251-0",
	abstract = "Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora, extending prior work of the HPLT project. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value."
	}
	```