You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Teacher Model: Vision-Language Model for Transliteration of Modi Script to Devanagari

Introduction

This repository hosts the official teacher model weights as described in the paper:

Historic Scripts to Modern Vision: A Novel Dataset and A VLM Framework for Transliteration of Modi Script to Devanagari
Paper Link (Springer LNCS)
Accepted at ICDAR 2025

Our model introduces a novel Vision-Language framework, leveraging the gemma-3-12b-it base, to automatically transliterate the historic Modi script into modern Devanagari, supporting research and digital preservation of rare manuscripts.


Model Description

  • Architecture: Vision-Language Model (VLM) based on gemma-3-12b-it
  • Task: End-to-end transliteration of scanned Modi script images into Devanagari text.
  • Teacher Model: This release contains the weights of the teacher model used for training and evaluation in the referenced paper.
  • Dataset: Fine-tuned and evaluated on the MoDeTrans and SynthMoDe datasets, introduced in the paper.

Installation

pip3 install pillow
pip3 install torch torchvision
pip3 install transformers peft accelerate

How to Use

from transformers import AutoProcessor, AutoModelForImageTextToText, AutoConfig
from PIL import Image
import torch
import torch.nn.functional as F
from peft import PeftModel 

device = "cuda:0"
model_id = "google/gemma-3-12b-it"
peft_model_path = "historyHulk/ModiTrans-12B-Gemma-Teacher"

model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map=device
)

model = PeftModel.from_pretrained(
    model,
    peft_model_path,
    device_map=device,
    torch_dtype=torch.bfloat16
)

image = Image.open("<Modi Script Image Preprocessed as in Dataset>").convert("RGB").resize((1024,512))

processor = AutoProcessor.from_pretrained(model_id)

messages = [
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "image":image
                    },
                    {
                        "type": "text",
                        "text": "Translitrate the following Modi script to Devnagri script."
                    },
                ],
            },
            {
                "role": "assistant",
                "content": [
                    {
                        "type": "text",
                    },
                ],
            },
]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)

input_len = inputs["input_ids"].shape[-1]
pixel_values = inputs['pixel_values']
pixel_values = pixel_values.to(dtype=model.dtype, device=model.device)

model.eval()
with torch.no_grad():
    input_ids = inputs["input_ids"]
    attention_masks = inputs["attention_mask"]
    pixel_values=pixel_values
    while True:
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_masks,
            pixel_values=pixel_values,
        )

        logits = outputs.logits[:,-1,:]
        probs = F.softmax(logits, dim=-1)
        next_token = torch.multinomial(probs, num_samples=1)
        input_ids = torch.cat([input_ids, next_token], dim=-1)
        attention_masks = torch.cat([attention_masks, torch.ones_like(next_token)], dim=-1)
        if next_token.item() == processor.tokenizer.eos_token_id or input_ids.shape[1] >= 350:
            break

generation = input_ids[:,input_len:][0]
generated_text = processor.decode(generation, skip_special_tokens=True)
print("\n\n\n")
print(generated_text)

Citation

If you use this model in your research or publications, please cite the following paper:

@InProceedings{10.1007/978-3-032-04630-7_3,
author="Kausadikar, Harshal
and Kale, Tanvi
and Susladkar, Onkar
and Mittal, Sparsh",
editor="Yin, Xu-Cheng
and Karatzas, Dimosthenis
and Lopresti, Daniel",
title="Historic Scripts to Modern Vision: A Novel Dataset and A VLM Framework for Transliteration of Modi Script to Devanagari",
booktitle="Document Analysis and Recognition --  ICDAR 2025",
year="2026",
publisher="Springer Nature Switzerland",
address="Cham",
pages="40--61",
abstract="In medieval India, the Marathi language was written using the Modi script. The texts written in Modi script include extensive knowledge about medieval sciences, medicines, land records and authentic evidence about Indian history. Around 40 million documents are in poor condition and have not yet been transliterated. Furthermore, only a few experts in this domain can transliterate this script into English or Devanagari. Most of the past research predominantly focuses on individual character recognition. A system that can transliterate Modi script documents to Devanagari script is needed. We propose the MoDeTrans dataset, comprising 2,043 images of Modi script documents accompanied by their corresponding textual transliterations in Devanagari. We further introduce MoScNet (Modi Script Network), a novel Vision-Language Model (VLM) framework for transliterating Modi script images into Devanagari text. MoScNet leverages Knowledge Distillation, where a student model learns from a teacher model to enhance transliteration performance. The final student model of MoScNet has better performance than the teacher model while having 163{\$}{\$}{\backslash}times {\$}{\$}{\texttimes}lower parameters. Our work is the first to perform direct transliteration from the handwritten Modi script to the Devanagari script. MoScNet also shows competitive results on the optical character recognition (OCR) task.",
isbn="978-3-032-04630-7"
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for historyHulk/ModiTrans-12B-Gemma-Teacher

Adapter
(86)
this model

Datasets used to train historyHulk/ModiTrans-12B-Gemma-Teacher