Fine-Tuned MuRIL Model On Indic Languages (RESPIN)

1. Abstract

This document details the evaluation and performance metrics of a fine-tuned MuRIL (Multilingual Representations for Indian Languages) model. The primary objective was to adapt the pre-trained MuRIL architecture to the Indic text domain using the RESPIN (REcognizing SPeech in INdian languages) corpus.

Performance was benchmarked against the original pre-trained google/muril-base-cased model, the multilingual industry standard xlm-roberta-base, and specialized monolingual architectures (e.g., L3Cube series). The evaluation demonstrates that the fine-tuned model establishes a new state-of-the-art for low-resource languages such as Chhattisgarhi and Magahi.

2. Methodology

2.1 Training Corpus

The model was fine-tuned utilizing the RESPIN dataset. This corpus comprises text data acquired through extensive web crawling, alongside content derived from newspapers and books processed via Optical Character Recognition (OCR). The heterogeneous nature of these sources necessitated domain-specific adaptation to minimize perplexity (PPL) and enhance model robustness across varied linguistic contexts.

2.2 Model Architecture

  • Base Architecture: MuRIL (BERT-based encoder).
  • Modification: Fine-tuned on the RESPIN corpus using Masked Language Modeling (MLM) objectives to align the vector space with the target distribution.

2.3 Evaluation Baselines

To provide a rigorous assessment of the model's efficacy, the following comparative baselines were utilized:

  • Base MuRIL (google/muril-base-cased): Used to quantify the net improvement (gain) achieved strictly through the fine-tuning process.
  • XLM-RoBERTa (xlm-roberta-base): Selected as the high-capacity multilingual baseline to evaluate zero-shot performance on low-resource languages.
  • Specialist Models (l3cube-pune): Monolingual models trained specifically on a single language. These represent the theoretical upper bound for performance in major languages.

3. Evaluation Datasets

Testing was conducted on a diverse suite of held-out evaluation sets to ensure validity. The evaluation suite includes:

  • Benchmark Testing (Legal): A high-complexity dataset comprising legal documents and bail applications (Available for HI, BN, TE, MR).
  • Samanantar: A general-domain parallel corpus utilized for translation benchmarks (Available for HI, BN, KN, TE, MR).
  • RESPIN (Held-Out): A specific split from the transcription corpus to test retention of the training distribution (Available for all languages).
  • Rural Women: A dialect-rich dataset utilized to test robustness in Bhojpuri (BH).
  • NanoBEIR: A retrieval benchmark dataset, cleaned for non-Devanagari artifacts, used for Maithili (MT) and Magahi (MAG).
  • Chhattisgarh TTS: A transcription dataset used for Chhattisgarhi (CH).
  • IISc-MILE: A speech transcription corpus used for Kannada (KN).

4. Empirical Results

The following table presents the Average Perplexity (PPL) scores across the test files for each language. Perplexity is defined as the exponential of the cross-entropy loss; lower values indicate superior predictive performance.

Table 1: Comparative Average Perplexity (Lower is Better)

Language Code Language Base MuRIL Fine-Tuned MuRIL (Ours) XLM-RoBERTa Specialist Model (L3Cube)
CH Chhattisgarhi 764.36 21.73 169.96 N/A
MAG Magahi 279.29 34.35 67.77 N/A
BH Bhojpuri 444.35 116.41 122.09 N/A
MT Maithili 598.79 141.09 90.83 N/A
HI Hindi 27.58 15.86 10.29 N/A
TE Telugu 62.98 15.41 12.07 7.43
MR Marathi 85.71 25.20 17.71 19.26
KN Kannada 172.59 28.87 15.85 13.23
BN Bengali 92.62 207.80 15.96 33.43

5. Analysis and Conclusion

5.1 Performance on Low-Resource Languages

The most significant observation is the model's performance on extremely low-resource languages (Chhattisgarhi, Magahi, Bhojpuri), where standard multilingual models typically fail.

  • Chhattisgarhi (CH): The Base MuRIL model exhibited a perplexity of 764.36, indicating a lack of comprehension. The Fine-Tuned model reduced this drastically to 21.73, outperforming the much larger XLM-R (169.96) by an order of magnitude.
  • Magahi (MAG): The Fine-Tuned model achieved a perplexity of 34.35, surpassing both the Base MuRIL (279.29) and the XLM-R baseline (67.77).
  • Bhojpuri (BH): The model demonstrated superior robustness, achieving a score of 116.41, edging out XLM-R (122.09) and vastly improving upon the Base MuRIL (444.35).

This validates the efficacy of the RESPIN dataset for adapting encoders to under-represented Indic dialects.

5.2 Performance on Major Languages

For widely spoken languages (Hindi, Telugu, Marathi), the fine-tuning process yielded substantial improvements over the Base MuRIL architecture.

  • Telugu (TE) & Marathi (MR): The Fine-Tuned model reduced perplexity by approximately 75% and 70% respectively compared to the Base model. While the specialized monolingual L3Cube models remain the leader in this category (as expected for single-language specialists), the Fine-Tuned MuRIL is a competitive multilingual alternative.
  • Bengali (BN): The model exhibited regression in the average perplexity score (207.80). Detailed analysis reveals that while the model performed well on the respin_bn split (68.14), it struggled to generalize to the benchmark_testing and spring_inx datasets.

5.3 Summary

The Fine-Tuned MuRIL model establishes a new benchmark for Chhattisgarhi and Magahi text encoding, significantly surpassing existing open-source multilingual alternatives.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support