--- library_name: transformers tags: - nlp - nmt - nagamese - low-resource - creole - facebook/nllb-200 language: - nag - eng metrics: - bleu - chrf --- # Nagamese-to-English NMT This model is a fine-tuned version of **facebook/nllb-200-distilled-600M** specifically adapted for **Nagamese (Naga Creole)** to English translation. It is designed to be the second stage of a speech processing pipeline, converting ASR-generated Nagamese text into English. ## Model Details ### Model Description Nagamese is a creole language primarily spoken in Nagaland, India. Because it lacks a standardized orthography and large-scale parallel datasets, standard translation models often fail. This model uses **Parameter-Efficient Fine-Tuning (LoRA)** to adapt the NLLB-200 architecture to the specific syntax and vocabulary of Nagamese using a highly curated, small-scale dataset. - **Developed by:** Kenei - **Model type:** Neural Machine Translation (Encoder-Decoder) - **Language(s):** Nagamese (Source) to English (Target) - **Finetuned from model:** facebook/nllb-200-distilled-600M ### Direct Use The model is intended to translate Nagamese text—specifically outputs from an Automatic Speech Recognition (ASR) system—into English. It is optimized for daily conversational sentences and common phrases. ### Out-of-Scope Use - Not intended for legal, medical, or official government documentation. - May struggle with highly technical or scientific jargon not present in the 500-sentence training set. ## Bias, Risks, and Limitations ### Technical Limitations - **Dataset Size:** Due to the "extreme low-resource" nature of the project (trained on ~500 unique parallel sentences), the model may over-rely on memorized phrases. - **Orthography:** Since Nagamese is often written phonetically, variations in spelling (e.g., "bosti" vs "boosti") may affect translation quality. ### Recommendations Users should pair this model with a robust ASR system. It is recommended to use **Label Smoothing** during inference to handle the inherent noise in Nagamese transcriptions. ## How to Get Started with the Model ```python from transformers import AutoModelForSeq2SeqLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("your-username/nagamese-nmt") model = AutoModelForSeq2SeqLM.from_pretrained("your-username/nagamese-nmt") text = "Ami bosti te jai ase." # "I am going to the village." inputs = tokenizer(text, return_tensors="pt") translated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["eng_Latn"]) print(tokenizer.batch_decode(translated_tokens, skip_special_tokens=True))