A collection of state-of-the-art multilingual base encoder language models (270M, 1B, 4B) for Indic languages.
AI4Bharat
non-profit
Verified
AI & ML interests
None defined yet.
Recent Activity
A comprehensive dataset collection for Indic language information retrieval.
Collection of Parler-TTS models adapted to Indian languages.
ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams.
A collection of ASR models for 22 scheduled languages of India
-
ai4bharat/indic-conformer-600m-multilingual
Updated • 29.6k • 39 -
ai4bharat/indicconformer_stt_as_hybrid_ctc_rnnt_large
Automatic Speech Recognition • Updated • 38 • 4 -
ai4bharat/indicconformer_stt_bn_hybrid_ctc_rnnt_large
Automatic Speech Recognition • Updated • 73 • 1 -
ai4bharat/indicconformer_stt_brx_hybrid_ctc_rnnt_large
Automatic Speech Recognition • Updated • 5
A collection of benchmarks used for evaluation of Airavata, an Hindi instruction-tuned model on top of Sarvam's OpenHathi base model.
IndicXTREME is a human-supervised benchmark of 9 diverse NLU tasks across 20 languages, featuring 105 evaluation sets in total.
IndicNLG Benchmark is a dataset collection designed for benchmarking Natural Language Generation (NLG) across 11 Indic languages.
This dataset includes ASR data from rural women speaking Hindi and Bhojpuri, supporting inclusive voice recognition.
Romansetu is a collection of models address the challenge of extending Large Language Models (LLMs) to non-English languages using non-Latin scripts
A Speech Translation Dataset for 13 Indian Languages
Hercule series of Evaluation models
Largest Collections of Pretraining and Instruction Finetuning datasets for 22 Indic languages.
Models(En-Indic, Indic-En, Indic-Indic) in 2 variants (base and dist) and Benchmarks (IN22-Gen and IN22-Conv) released as a part of IndicTrans2.
-
ai4bharat/indictrans2-en-indic-1B
Translation • 1B • Updated • 3.66k • 42 -
ai4bharat/indictrans2-en-indic-dist-200M
Translation • 0.3B • Updated • 3.83k • 20 -
ai4bharat/indictrans2-indic-en-1B
Translation • 1B • Updated • 3.96k • 28 -
ai4bharat/indictrans2-indic-en-dist-200M
Translation • 0.2B • Updated • 3.59k • 6
IndicBERT v2 is a multilingual BERT model pretrained on IndicCorp v2, an Indic monolingual corpus of 20.9 billion tokens, covering 24 consitutionally
A collection of state-of-the-art multilingual base encoder language models (270M, 1B, 4B) for Indic languages.
This dataset includes ASR data from rural women speaking Hindi and Bhojpuri, supporting inclusive voice recognition.
A comprehensive dataset collection for Indic language information retrieval.
Romansetu is a collection of models address the challenge of extending Large Language Models (LLMs) to non-English languages using non-Latin scripts
Collection of Parler-TTS models adapted to Indian languages.
A Speech Translation Dataset for 13 Indian Languages
ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams.
Hercule series of Evaluation models
A collection of ASR models for 22 scheduled languages of India
-
ai4bharat/indic-conformer-600m-multilingual
Updated • 29.6k • 39 -
ai4bharat/indicconformer_stt_as_hybrid_ctc_rnnt_large
Automatic Speech Recognition • Updated • 38 • 4 -
ai4bharat/indicconformer_stt_bn_hybrid_ctc_rnnt_large
Automatic Speech Recognition • Updated • 73 • 1 -
ai4bharat/indicconformer_stt_brx_hybrid_ctc_rnnt_large
Automatic Speech Recognition • Updated • 5
Largest Collections of Pretraining and Instruction Finetuning datasets for 22 Indic languages.
A collection of benchmarks used for evaluation of Airavata, an Hindi instruction-tuned model on top of Sarvam's OpenHathi base model.
Models(En-Indic, Indic-En, Indic-Indic) in 2 variants (base and dist) and Benchmarks (IN22-Gen and IN22-Conv) released as a part of IndicTrans2.
-
ai4bharat/indictrans2-en-indic-1B
Translation • 1B • Updated • 3.66k • 42 -
ai4bharat/indictrans2-en-indic-dist-200M
Translation • 0.3B • Updated • 3.83k • 20 -
ai4bharat/indictrans2-indic-en-1B
Translation • 1B • Updated • 3.96k • 28 -
ai4bharat/indictrans2-indic-en-dist-200M
Translation • 0.2B • Updated • 3.59k • 6
IndicXTREME is a human-supervised benchmark of 9 diverse NLU tasks across 20 languages, featuring 105 evaluation sets in total.
IndicBERT v2 is a multilingual BERT model pretrained on IndicCorp v2, an Indic monolingual corpus of 20.9 billion tokens, covering 24 consitutionally
IndicNLG Benchmark is a dataset collection designed for benchmarking Natural Language Generation (NLG) across 11 Indic languages.