Update README.md
Browse files
README.md
CHANGED
|
@@ -1,39 +1,41 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
|
| 4 |
-
language:
|
| 5 |
-
- en
|
| 6 |
-
|
| 7 |
-
widget:
|
| 8 |
-
- text: "Let us translate some text from Livonian to Võro!"
|
| 9 |
-
---
|
| 10 |
-
|
| 11 |
-
# NMT for Finno-Ugric Languages
|
| 12 |
-
|
| 13 |
-
This is an NMT system for translating between Võro, Livonian, North Sami, South Sami as well as Estonian, Finnish, Latvian and English. It was created by fine-tuning Facebook's m2m100-418M on the liv4ever and smugri datasets.
|
| 14 |
-
|
| 15 |
-
## Tokenizer
|
| 16 |
-
Four language codes were added to the tokenizer: __liv__, __vro__, __sma__ and __sme__. Currently the m2m100 tokenizer loads the list of languages from a hard-coded list, so it has to be updated after loading; see the code example below.
|
| 17 |
-
|
| 18 |
-
## Usage example
|
| 19 |
-
Install the transformers and sentencepiece libraries: `pip install sentencepiece transformers`
|
| 20 |
-
|
| 21 |
-
```from transformers import
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
tokenizer
|
| 25 |
-
|
| 26 |
-
tokenizer.
|
| 27 |
-
tokenizer.
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
|
|
|
|
|
|
| 39 |
The output is `Livčča giella eallá.`
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
|
| 4 |
+
language:
|
| 5 |
+
- en
|
| 6 |
+
|
| 7 |
+
widget:
|
| 8 |
+
- text: "Let us translate some text from Livonian to Võro!"
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
# NMT for Finno-Ugric Languages
|
| 12 |
+
|
| 13 |
+
This is an NMT system for translating between Võro, Livonian, North Sami, South Sami as well as Estonian, Finnish, Latvian and English. It was created by fine-tuning Facebook's m2m100-418M on the liv4ever and smugri datasets.
|
| 14 |
+
|
| 15 |
+
## Tokenizer
|
| 16 |
+
Four language codes were added to the tokenizer: __liv__, __vro__, __sma__ and __sme__. Currently the m2m100 tokenizer loads the list of languages from a hard-coded list, so it has to be updated after loading; see the code example below.
|
| 17 |
+
|
| 18 |
+
## Usage example
|
| 19 |
+
Install the transformers and sentencepiece libraries: `pip install sentencepiece transformers`
|
| 20 |
+
|
| 21 |
+
```from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
|
| 22 |
+
|
| 23 |
+
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
|
| 24 |
+
tokenizer = AutoTokenizer.from_pretrained("tartuNLP/m2m100_418M_smugri")
|
| 25 |
+
#Fix the language codes in the tokenizer
|
| 26 |
+
tokenizer.id_to_lang_token = dict(list(tokenizer.id_to_lang_token.items()) + list(tokenizer.added_tokens_decoder.items()))
|
| 27 |
+
tokenizer.lang_token_to_id = dict(list(tokenizer.lang_token_to_id.items()) + list(tokenizer.added_tokens_encoder.items()))
|
| 28 |
+
tokenizer.lang_code_to_token = { k.replace("_", ""): k for k in tokenizer.additional_special_tokens }
|
| 29 |
+
tokenizer.lang_code_to_id = { k.replace("_", ""): v for k, v in tokenizer.lang_token_to_id.items() }
|
| 30 |
+
|
| 31 |
+
model = AutoModelForSeq2SeqLM.from_pretrained("tartuNLP/m2m100_418M_smugri")
|
| 32 |
+
|
| 33 |
+
tokenizer.src_lang = 'liv'
|
| 34 |
+
|
| 35 |
+
encoded_src = tokenizer("Līvõ kēļ jelāb!", return_tensors="pt")
|
| 36 |
+
|
| 37 |
+
encoded_out = model.generate(**encoded_src, forced_bos_token_id = tokenizer.get_lang_id("sme"))
|
| 38 |
+
print(tokenizer.batch_decode(encoded_out, skip_special_tokens=True))
|
| 39 |
+
```
|
| 40 |
+
|
| 41 |
The output is `Livčča giella eallá.`
|