---
library_name: transformers
tags:
- nlp
- nmt
- nagamese
- low-resource
- creole
- facebook/nllb-200
language:
- nag
- eng
metrics:
- bleu
- chrf
---

# Nagamese-to-English NMT 

This model is a fine-tuned version of **facebook/nllb-200-distilled-600M** specifically adapted for **Nagamese (Naga Creole)** to English translation. It is designed to be the second stage of a speech processing pipeline, converting ASR-generated Nagamese text into English.

## Model Details

### Model Description
Nagamese is a creole language primarily spoken in Nagaland, India. Because it lacks a standardized orthography and large-scale parallel datasets, standard translation models often fail. This model uses **Parameter-Efficient Fine-Tuning (LoRA)** to adapt the NLLB-200 architecture to the specific syntax and vocabulary of Nagamese using a highly curated, small-scale dataset.

- **Developed by:** Kenei
- **Model type:** Neural Machine Translation (Encoder-Decoder)
- **Language(s):** Nagamese (Source) to English (Target)
- **Finetuned from model:** facebook/nllb-200-distilled-600M


### Direct Use
The model is intended to translate Nagamese text—specifically outputs from an Automatic Speech Recognition (ASR) system—into English. It is optimized for daily conversational sentences and common phrases.

### Out-of-Scope Use
- Not intended for legal, medical, or official government documentation.
- May struggle with highly technical or scientific jargon not present in the 500-sentence training set.

## Bias, Risks, and Limitations

### Technical Limitations
- **Dataset Size:** Due to the "extreme low-resource" nature of the project (trained on ~500 unique parallel sentences), the model may over-rely on memorized phrases.
- **Orthography:** Since Nagamese is often written phonetically, variations in spelling (e.g., "bosti" vs "boosti") may affect translation quality.

### Recommendations
Users should pair this model with a robust ASR system. It is recommended to use **Label Smoothing** during inference to handle the inherent noise in Nagamese transcriptions.

## How to Get Started with the Model

```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("your-username/nagamese-nmt")
model = AutoModelForSeq2SeqLM.from_pretrained("your-username/nagamese-nmt")

text = "Ami bosti te jai ase." # "I am going to the village."
inputs = tokenizer(text, return_tensors="pt")

translated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["eng_Latn"])
print(tokenizer.batch_decode(translated_tokens, skip_special_tokens=True))