--- library_name: transformers license: apache-2.0 datasets: - nis12ram/entity_type_hi_pilener_constraint - nis12ram/HindiNER-golden-dataset-constraint1 - nis12ram/HindiNER-golden-dataset-constraint2 - nis12ram/HindiNER-golden-dataset-constraint3 - nis12ram/HindiNER-golden-dataset-constraint4 - nis12ram/HindiNER-golden-dataset-constraint5 - nis12ram/HindiNER-golden-dataset2 - nis12ram/entity_type_hi_pilener_constraint-neg-corr - nis12ram/HindiNER-golden-dataset-constraint-neg-corr language: - hi - en metrics: - f1 base_model: - nis12ram/Nemotron-4-Mini-Hindi-4B-Instruct --- # Model Card for HindiNER-4B-v1.0 **[HindiNER-4B-v1.0](https://huggingface.co/nis12ram/HindiNER-4B-v1.0)** - **a general and constraint Hindi NER model** ## Model Details ### Model Description **[HindiNER-4B-v1.0](https://huggingface.co/nis12ram/HindiNER-4B-v1.0)** is a 4B general Hindi NER model, built on top of [Nemotron-4-Mini-Hindi-4B-Instruct](https://huggingface.co/nis12ram/Nemotron-4-Mini-Hindi-4B-Instruct), utilizing data-efficient lora training strategy, and supports a context window of 4096 tokens. - **Developed by:** [nis12ram](https://huggingface.co/nis12ram) - **Model type:** Autoregressive model - **Language(s) (NLP):** [Hindi, English] - **License:** Apache License 2.0 - **Finetuned from model:** [Nemotron-4-Mini-Hindi-4B-Instruct](https://huggingface.co/nis12ram/Nemotron-4-Mini-Hindi-4B-Instruct) ### Source Text Language Support: - Hindi(written in Devanagari script) [**primary support**] - English - Hinglish or Romanized Hindi - Mix of all --- **NOTE** The model was not explicitly trained for NER on Hinglish data. However, since [Nemotron-4-Mini-Hindi-4B-Instruct](https://huggingface.co/nis12ram/Nemotron-4-Mini-Hindi-4B-Instruct) was also pretrained on Romanized Hindi, the fine-tuned model generalizes well to Hinglish. **From [Nemotron-4-Mini-Hindi-4B-Instruct](https://huggingface.co/nis12ram/Nemotron-4-Mini-Hindi-4B-Instruct) paper** > The translated Hindi data comprises approximately 60 billion tokens. We then combine this synthetic data with around 40 billion real tokens (web-scraped data) to create a dataset totaling 100 billion Hindi tokens. Additionally, this entire Hindi text is transliterated into Roman script, expanding the total dataset to 220 billion tokens. The transliterated tokens are included to enable the model to support Hinglish queries. --- ### Entity Type Language Support: - Only Hindi(written in Devanagari script) ### Model's Prompt & Desired Output - prompt: ````python prompt = '''System You are a text‐reader and entity extractor. When given a text, read it and reply “I have read the text.” Then, when the user provides an entity type in Hindi, extract and return a list of all matching entities from the text. User {input} Assistant I have read the text. User {entity_type} Assistant ''' ```` - desired output structure: The model's output is a string representation of a JSON array (list), which can be directly parsed using tools such as json.loads() in Python or equivalent functions in other languages. --- ## How to Get Started with the Model Check out the [Colab Notebook](https://colab.research.google.com/drive/1pQAuCV7NVE7A6OD42ci2CdKiQZZEDwG6?usp=sharing) ## Training Details ### Training Data To build a high-quality training dataset, a two-step approach is implemented: 1. **Data Curation** 2. **Data Augmentation** #### Data Curation ##### ***1st Dataset*** There are a few good traditional Hindi NER datasets available, but as far as I know, there’s no publicly available general-purpose Hindi NER dataset like [Pile-NER-type](https://huggingface.co/datasets/Universal-NER/Pile-NER-type) for English. So I decide to manually collect and annotate a general Hindi NER dataset — small but rich, diverse, and aligned with the indian context. To know more, please refer to **[HindiNER-golden-dataset](https://huggingface.co/datasets/nis12ram/HindiNER-golden-dataset)***(952 datapoints)* ##### ***2nd Dataset*** To teach the model, how to properly do NER [HindiNER-golden-dataset](https://huggingface.co/datasets/nis12ram/HindiNER-golden-dataset)*(952 datapoints)* won't be enough. Taking inspiration from these well known fact that > Intermediate tuning on similar tasks can enhance performance on low-resource downstream task. After some experimentation, the most similar dataset comes out to be [Pile-NER-type](https://huggingface.co/datasets/Universal-NER/Pile-NER-type). To further align the [Pile-NER-type](https://huggingface.co/datasets/Universal-NER/Pile-NER-type) with the final model's objective of using only Hindi **entity types**, a translation phase is conducted in which all English **entity types** are translated into Hindi. To know more, please refer to **[entity_type_hi_pilener](https://huggingface.co/datasets/nis12ram/entity_type_hi_pilener)***(37,859 datapoints)* #### Data Augmentation *pre-defined augmentation variables* max_entity_type_value_paris = 32, negative_entities_percent = 50 ##### ***Augmenting HindiNER-golden-dataset*** Due to manual collection and annotation, the size of [HindiNER-golden-dataset](https://huggingface.co/datasets/nis12ram/HindiNER-golden-dataset) is just limited to 952 datapoints. To facilitate effective learning from these limited size dataset, an oversampling strategy is used. *Oversampling strategy* - *Step 1*: Created 5 copies of [HindiNER-golden-dataset](https://huggingface.co/datasets/nis12ram/HindiNER-golden-dataset). - *Step 2*: Randomly shuffling entity type-value pairs in each copy. - *Step 3*: Randomly dropping entity type-value pairs from each copy, if total entity type-value pairs in a datapoint exceeds **max_entity_type_value_paris**/2. - *Step 4*: Created a list of all rare enity types **rare_entity_type_lst**. - *Step 5*: Adding **negative_entities_percent**% negative entity type-value pairs to each datapoint in all copies by randomly selecting from **rare_entity_type_lst**. - *Step 6*: Randomly shuffling entity type-value pairs in each copy. To know more, please refer to [HindiNER-golden-dataset-constraint1](https://huggingface.co/datasets/nis12ram/HindiNER-golden-dataset-constraint1), [HindiNER-golden-dataset-constraint2](https://huggingface.co/datasets/nis12ram/HindiNER-golden-dataset-constraint2), [HindiNER-golden-dataset-constraint3](https://huggingface.co/datasets/nis12ram/HindiNER-golden-dataset-constraint3), [HindiNER-golden-dataset-constraint4](https://huggingface.co/datasets/nis12ram/HindiNER-golden-dataset-constraint4), [HindiNER-golden-dataset-constraint5](https://huggingface.co/datasets/nis12ram/HindiNER-golden-dataset-constraint5) [HindiNER-golden-dataset-constraint-neg-corr](https://huggingface.co/datasets/nis12ram/HindiNER-golden-dataset-constraint-neg-corr) (**Not Documented**) ##### ***Augmenting entity_type_hi_pilener*** Here also a similar augmentation strategy is used. *Augmentation strategy* - *Step 1*: Randomly shuffling entity type-value pairs in entity_type_hi_pilener. - *Step 2*: Randomly dropping entity type-value pairs, if total entity type-value pairs in a datapoint exceeds **max_entity_type_value_paris**/2. - *Step 3*: Created a list of all rare enity types **rare_entity_type_lst**. - *Step 4*: Adding **negative_entities_percent**% negative entity type-value pairs to each datapoint by randomly selecting from **rare_entity_type_lst**. - *Step 5*: Randomly shuffling entity type-value pairs in entity_type_hi_pilener. To know more, please refer to [entity_type_hi_pilener_constraint](https://huggingface.co/datasets/nis12ram/entity_type_hi_pilener_constraint), [entity_type_hi_pilener_constraint-neg-corr](https://huggingface.co/datasets/nis12ram/entity_type_hi_pilener_constraint-neg-corr) (**Not Documented**) #### Datset Size - 4760(952*5) -> HindiNER-golden-dataset-constraint1, ..., HindiNER-golden-dataset-constraint5 - 37855 -> entity_type_hi_pilener_constraint - 42615 -> total ### Training Procedure To build a high-quality model, a two-step approach is implemented: 1. **Core** 2. **Polish** #### Core ***Model is trained in a multi-turn conversation fashion, with a single entity type per turn.*** - Example: ````python sample_conversation = '''System You are a text‐reader and entity extractor. When given a text, read it and reply “I have read the text.” Then, when the user provides an entity type in Hindi, extract and return a list of all matching entities from the text. User 2024 में, प्रधानमंत्री नरेंद्र मोदी ने वाराणसी में एक नई AI रिसर्च लैब का उद्घाटन किया। इस कार्यक्रम में गूगल इंडिया, IIT दिल्ली और नीति आयोग के प्रतिनिधि मौजूद थे। उद्घाटन समारोह 15 अगस्त 2024 को हुआ, जिसमें रतन टाटा और सचिन तेंदुलकर भी विशेष अतिथि के रूप में शामिल हुए। Assistant I have read the text. User प्रधानमंत्री Assistant ["नरेंद्र मोदी"] User वर्ष Assistant ["2024"] User आपातकालीन घटना Assistant [] User संगठन Assistant ["गूगल इंडिया", "IIT दिल्ली", "नीति आयोग"] ''' ```` Training Details: - Training technique = Lora - Dataset = [ nis12ram/HindiNER-golden-dataset-constraint1, nis12ram/HindiNER-golden-dataset-constraint2, nis12ram/HindiNER-golden-dataset-constraint3, nis12ram/HindiNER-golden-dataset-constraint4, nis12ram/HindiNER-golden-dataset-constraint5, nis12ram/entity_type_hi_pilener_constraint ] - Dataset Format = Mulit-Turn conversation - Lora rank = 512 - Lora alpha = 512 - Lora target modules = ["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "down_proj"] - Batch size = 16 - Gradient accumulation = 1 - Warmup ratio = 0.03 - Epochs = 1 - Learning rate = 5e-5 - Optimizer = adamw_8bit - Learning rate scheduler = linear - Weight decay = 0.01 - max seq length = 4000 Check out the [Colab Notebook](https://colab.research.google.com/drive/1Qo90Mk6UU81Bi05YmTHsdOFg-HfoJEWH?usp=sharing) for the training code. Check out the [Model](https://huggingface.co/nis12ram/Nemotron-4-Mini-Hindi-4B-constraint-phase1-exp2) obtained from **Core** training. Q. *What did **Core** training* achieve? Ans. The **Core** training produces a model highly effective in Named Entity Recognition (NER) for Hindi, English, and Hinglish, while preserving the desired output structure. Q. *What did **Core** training lack?* Ans. The **Core** training produces a model that works superbly in extracting entity values for positive entity types, but lacks the capability to understand negative entity types and often hallucinates when handling them. #### Polish The **Polish** training was not predetermined in terms of dataset, dataset format, and training hyperparameters. All these training components were decided based on the limitations of the model produced by the **Core** training. ***Model is trained in a single-turn conversation fashion.*** - Example: ````python sample_conversation = '''System You are a text‐reader and entity extractor. When given a text, read it and reply “I have read the text.” Then, when the user provides an entity type in Hindi, extract and return a list of all matching entities from the text. User 2024 में, प्रधानमंत्री नरेंद्र मोदी ने वाराणसी में एक नई AI रिसर्च लैब का उद्घाटन किया। इस कार्यक्रम में गूगल इंडिया, IIT दिल्ली और नीति आयोग के प्रतिनिधि मौजूद थे। उद्घाटन समारोह 15 अगस्त 2024 को हुआ, जिसमें रतन टाटा और सचिन तेंदुलकर भी विशेष अतिथि के रूप में शामिल हुए। Assistant I have read the text. User प्रधानमंत्री Assistant ["नरेंद्र मोदी"] ''' ```` *A manual dataset is collected with the objective of mitigating the limitations of **Core** training.* *To know more about dataset, please check out [HindiNER-golden-dataset2](https://huggingface.co/datasets/nis12ram/HindiNER-golden-dataset2/blob/main/HindiNER-golden-dataset2.json).* ***To avoid catastophic forgetting, 2 datasets were arranged.*** 1. [entity_type_hi_pilener_constraint-neg-corr](https://huggingface.co/datasets/nis12ram/entity_type_hi_pilener_constraint-neg-corr) 2. [HindiNER-golden-dataset-constraint-neg-corr](https://huggingface.co/datasets/nis12ram/HindiNER-golden-dataset-constraint-neg-corr) Training Details: - Training technique = Lora - Dataset = [ nis12ram/HindiNER-golden-dataset2, nis12ram/entity_type_hi_pilener_constraint-neg-corr, nis12ram/HindiNER-golden-dataset-constraint-neg-corr ] - Dataset Format = Single-Turn conversation - Lora rank = 8 - Lora alpha = 8 - Lora target modules = ["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "down_proj"] - Batch size = 4 - Gradient accumulation = 2 - Warmup ratio = 0.00 - Epochs = 1 - Learning rate = 2e-4 - Optimizer = adamw_8bit - Learning rate scheduler = linear - Weight decay = 0.01 - max seq length = 2048 Check out the [Colab Notebook](https://colab.research.google.com/drive/1RSng4lJ_BZAVrXNPLvGA-xK8tcv7S2mc?usp=sharing) for the training code. Q. *What did **Polish** training* achieve? Ans. The **Polish** training produces a model highly effective in handling positive and negative entity types. Q. *What did **Polish** training lack?* Ans. The final model is capable in handling source text of different format, style & language, but still their can be cases when model hallucinates. ### Experiments Tried but Didn't Work 1. Training [Nemotron-4-Mini-Hindi-4B-Instruct](https://huggingface.co/nis12ram/Nemotron-4-Mini-Hindi-4B-Instruct) by passing all or multiple entity type in a single turn, result in a model that has lower accuracy and poor structure-following capabilities then [HindiNER-4B-v1.0](https://huggingface.co/nis12ram/HindiNER-4B-v1.0) which is trained by passing a single entity type per turn. A model trained by passing multiple entity types in a single turn does not produce good results, even when only a single entity type is passed in a turn. **Possible Reason** **From [UniversalNER: TARGETED DISTILLATION FROM LARGE LANGUAGE MODELS FOR OPEN NAMED ENTITY RECOGNITION](https://arxiv.org/pdf/2308.03279) paper** > When the model is required to handle multiple entity types within a single query, it might disperse its attention across these varied types, possibly resulting in less accurate identification for each individual type. Conversely, by decomposing the task into several simpler ones, each focusing on one entity type at a time, the model might be better equipped to handle the complexity, thus yielding more accurate results 2. Simple duplication of the [HindiNER-golden-dataset](https://huggingface.co/datasets/nis12ram/HindiNER-golden-dataset) as an oversampling strategy imporves model performance but is inferior compared to the oversampling strategy mentioned above. 3. Multi-Turn conversation based dataset format in **Polish** training, produces a model that is not robust enough to handle most negative entity type cases. **Possible Reason** > Polish training phase is mainly about very fine-grained updates in the model behaviour, And learning such fine-grained behaviour would be better when only single entity type is passed per source text. 4. **Polish** training, using only the [HindiNER-golden-dataset2](https://huggingface.co/datasets/nis12ram/HindiNER-golden-dataset2/blob/main/HindiNER-golden-dataset2.json), learns to handle negative entity types but forgets many of the learnings from **Core** training. ### Evaluation ***Evaluation is done using a twofold process:*** 1. Auotmatic Evaluation 2. Human Evaluation #### Automatic Evaluation Please refer to these [Linkedin article](https://www.linkedin.com/pulse/lessons-learned-while-building-evaluation-pipeline-ner-choudhary-quykc) to know how automated evaluation is performed. **Result** --- ##### **🌐 Language: `hi`** | Category | F1 Score | |----------------|----------| | 📰 News | 0.9295 | | 💻 Coding | 0.8712 | | 📄 Long Article | 0.9176 | | 🏥 Medical | 0.9167 | | ➕ Math | 0.9143 | | 📦 Other | 0.9613 | | 💬 Conversation | 0.9497 | | ⚗️ Chemistry | 0.7500 | **🔹 Language-level F1:** `0.9013` --- ##### **🌐 Language: `en`** | Category | F1 Score | |----------------|----------| | 💻 Coding | 0.8589 | | 🏥 Medical | 0.7667 | | ➕ Math | 0.0000 | | 📦 Other | 0.9658 | | 💬 Conversation | 0.9490 | **🔹 Language-level F1:** `0.7081` --- ##### **🌐 Language: `hing`** | Category | F1 Score | |----------|----------| | 💬 Chat | 0.7201 | | 📦 Other | 0.9903 | **🔹 Language-level F1:** `0.8552` --- 🏁 **Final Overall F1 Score:** `0.8215` Check out the [Colab Notebook](https://colab.research.google.com/drive/1tFGkEPcCiU1J2IOENDqq6VZlanPVRxzX?usp=sharing) for evaluation code. Check out the [Evaluation dataset](https://huggingface.co/datasets/nis12ram/HindiNER-golden-eval-dataset/blob/main/benchmark_data.json) #### Human Evaluation Human evaluation is basically done using diverse source texts to check the model's predictions in different scenarios. ### Usage Details 1. Stop token should be set to ``. 2. Greedy sampling should be prefered. 3. vllm should be perfered for fast and optimal batch inferencing.