Teacher Model: Vision-Language Model for Transliteration of Modi Script to Devanagari
Introduction
This repository hosts the official teacher model weights as described in the paper:
Historic Scripts to Modern Vision: A Novel Dataset and A VLM Framework for Transliteration of Modi Script to Devanagari
Paper Link (Springer LNCS)
Accepted at ICDAR 2025
Our model introduces a novel Vision-Language framework, leveraging the gemma-3-12b-it base, to automatically transliterate the historic Modi script into modern Devanagari, supporting research and digital preservation of rare manuscripts.
Model Description
- Architecture: Vision-Language Model (VLM) based on
gemma-3-12b-it - Task: End-to-end transliteration of scanned Modi script images into Devanagari text.
- Teacher Model: This release contains the weights of the teacher model used for training and evaluation in the referenced paper.
- Dataset: Fine-tuned and evaluated on the MoDeTrans and SynthMoDe datasets, introduced in the paper.
Installation
pip3 install pillow
pip3 install torch torchvision
pip3 install transformers peft accelerate
How to Use
from transformers import AutoProcessor, AutoModelForImageTextToText, AutoConfig
from PIL import Image
import torch
import torch.nn.functional as F
from peft import PeftModel
device = "cuda:0"
model_id = "google/gemma-3-12b-it"
peft_model_path = "historyHulk/ModiTrans-12B-Gemma-Teacher"
model = AutoModelForImageTextToText.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map=device
)
model = PeftModel.from_pretrained(
model,
peft_model_path,
device_map=device,
torch_dtype=torch.bfloat16
)
image = Image.open("<Modi Script Image Preprocessed as in Dataset>").convert("RGB").resize((1024,512))
processor = AutoProcessor.from_pretrained(model_id)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image":image
},
{
"type": "text",
"text": "Translitrate the following Modi script to Devnagri script."
},
],
},
{
"role": "assistant",
"content": [
{
"type": "text",
},
],
},
]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)
input_len = inputs["input_ids"].shape[-1]
pixel_values = inputs['pixel_values']
pixel_values = pixel_values.to(dtype=model.dtype, device=model.device)
model.eval()
with torch.no_grad():
input_ids = inputs["input_ids"]
attention_masks = inputs["attention_mask"]
pixel_values=pixel_values
while True:
outputs = model(
input_ids=input_ids,
attention_mask=attention_masks,
pixel_values=pixel_values,
)
logits = outputs.logits[:,-1,:]
probs = F.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
input_ids = torch.cat([input_ids, next_token], dim=-1)
attention_masks = torch.cat([attention_masks, torch.ones_like(next_token)], dim=-1)
if next_token.item() == processor.tokenizer.eos_token_id or input_ids.shape[1] >= 350:
break
generation = input_ids[:,input_len:][0]
generated_text = processor.decode(generation, skip_special_tokens=True)
print("\n\n\n")
print(generated_text)
Citation
If you use this model in your research or publications, please cite the following paper:
@InProceedings{10.1007/978-3-032-04630-7_3,
author="Kausadikar, Harshal
and Kale, Tanvi
and Susladkar, Onkar
and Mittal, Sparsh",
editor="Yin, Xu-Cheng
and Karatzas, Dimosthenis
and Lopresti, Daniel",
title="Historic Scripts to Modern Vision: A Novel Dataset and A VLM Framework for Transliteration of Modi Script to Devanagari",
booktitle="Document Analysis and Recognition -- ICDAR 2025",
year="2026",
publisher="Springer Nature Switzerland",
address="Cham",
pages="40--61",
abstract="In medieval India, the Marathi language was written using the Modi script. The texts written in Modi script include extensive knowledge about medieval sciences, medicines, land records and authentic evidence about Indian history. Around 40 million documents are in poor condition and have not yet been transliterated. Furthermore, only a few experts in this domain can transliterate this script into English or Devanagari. Most of the past research predominantly focuses on individual character recognition. A system that can transliterate Modi script documents to Devanagari script is needed. We propose the MoDeTrans dataset, comprising 2,043 images of Modi script documents accompanied by their corresponding textual transliterations in Devanagari. We further introduce MoScNet (Modi Script Network), a novel Vision-Language Model (VLM) framework for transliterating Modi script images into Devanagari text. MoScNet leverages Knowledge Distillation, where a student model learns from a teacher model to enhance transliteration performance. The final student model of MoScNet has better performance than the teacher model while having 163{\$}{\$}{\backslash}times {\$}{\$}{\texttimes}lower parameters. Our work is the first to perform direct transliteration from the handwritten Modi script to the Devanagari script. MoScNet also shows competitive results on the optical character recognition (OCR) task.",
isbn="978-3-032-04630-7"
}
- Downloads last month
- -
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support