Lapa Ukrainian Handwriting OCR — LoRA Adapter

LoRA adapter on top of lapa-llm/lapa-v0.1.2-instruct (a Gemma-3-12B Ukrainian vision-language model) for Ukrainian handwritten-text recognition (HTR / OCR) on document crops.

The base Lapa model, applied zero-shot to handwriting crops, tends to paraphrase rather than transcribe literally. This adapter retrains the text decoder to emit a literal transcription of the text in the image. It was developed as an OCR component for a Ukrainian HTR pipeline (handwritten + printed regions, math formulas).

Results (internal validation)

Metric	Base Lapa (bf16)	+ this LoRA
Handwritten CER	3.28	0.113
Handwritten exact-match	1.3%	47.7%
Printed CER	1.08	0.187

CER > 1 on the base reflects heavy paraphrasing (output far longer than ground truth). The adapter removes that behavior and produces faithful transcriptions.

Intended use

Transcribing Ukrainian handwritten / printed text crops (region-level images, not full pages) into plain text.
As a cross-vote / ensemble OCR partner alongside other VLMs.

Not tuned for: full-page layout, non-Ukrainian scripts, or marginal / very low-quality regions (CER rises to ~0.55 on hard, low-confidence regions).

How to use

import torch
from PIL import Image
from peft import PeftModel
from transformers import AutoModelForImageTextToText, AutoProcessor

BASE = "lapa-llm/lapa-v0.1.2-instruct"
ADAPTER = "lapa-llm/lapa-ocr-lora"  # this repo

base = AutoModelForImageTextToText.from_pretrained(
    BASE,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="sdpa",
)
model = PeftModel.from_pretrained(base, ADAPTER).eval()
processor = AutoProcessor.from_pretrained(BASE)

PROMPT = "Transcribe Ukrainian text literally. Output only the text, no preamble."
img = Image.open("crop.png").convert("RGB")
messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": img},
        {"type": "text", "text": PROMPT},
    ],
}]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt", padding=True,
).to(model.device, dtype=torch.bfloat16)

with torch.inference_mode():
    gen = model.generate(**inputs, max_new_tokens=256, do_sample=False, num_beams=1)
text = processor.batch_decode(
    gen[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True
)[0].strip()
print(text)

Run on a single 24 GB GPU (A10G / RTX 3090 / A5000 / L4)

The base is a 12B model — bf16 weights are ~24 GB and will not fit a 24 GB card alongside the KV cache, so you must quantize. Naive 4/8-bit loading often produces empty output / a repeated pad token (id 0). That symptom is almost always an environment issue, not a model or adapter defect. Two requirements are easy to miss:

torch >= 2.6 is mandatory. With transformers 4.57, Gemma 3 builds its bidirectional image-attention mask with or_mask_function, which raises ValueError: Using or_mask_function ... require torch>=2.6 on torch 2.5.x — for both eager and sdpa. On older torch every forward pass dies and you get empty / garbage output.
A C compiler must be present. bitsandbytes ≥ 0.49 pulls triton, which JIT-compiles a CUDA helper at import. Without gcc you get RuntimeError: Failed to find C compiler, surfaced confusingly as ModuleNotFoundError: validate_bnb_backend_availability.

Environment:

apt-get update && apt-get install -y build-essential          # gcc, for triton's JIT
pip install -U "torch>=2.6" torchvision --index-url https://download.pytorch.org/whl/cu124
pip install -U "transformers>=4.57" "peft>=0.19" "accelerate>=1.0" \
               "bitsandbytes>=0.49" pillow sentencepiece

Load quantized — keep the vision tower, projector and embeddings out of quantization, use bfloat16 compute (never fp16 — Gemma 3 overflows), and reinstate the stop tokens:

import torch
from PIL import Image
from peft import PeftModel
from transformers import AutoModelForImageTextToText, AutoProcessor, BitsAndBytesConfig

BASE = "lapa-llm/lapa-v0.1.2-instruct"
ADAPTER = "lapa-llm/lapa-ocr-lora"  # this repo

# 4-bit NF4 (~9 GB). For ~bf16 fidelity use load_in_8bit=True instead (~15 GB).
bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,   # bf16, NOT fp16
    llm_int8_skip_modules=["vision_tower", "multi_modal_projector", "lm_head", "embed_tokens"],
)

model = AutoModelForImageTextToText.from_pretrained(
    BASE, quantization_config=bnb, torch_dtype=torch.bfloat16,
    device_map="auto", attn_implementation="eager",
)
model = PeftModel.from_pretrained(model, ADAPTER).eval()

# Some load paths drop the generation config (runtime eos_token_id=None); reinstate it.
model.generation_config.eos_token_id = [1, 106]   # <eos>, <end_of_turn>
model.generation_config.pad_token_id = 0

processor = AutoProcessor.from_pretrained(BASE)
PROMPT = "Transcribe Ukrainian text literally. Output only the text, no preamble."
img = Image.open("crop.png").convert("RGB")
messages = [{"role": "user", "content": [
    {"type": "image", "image": img}, {"type": "text", "text": PROMPT}]}]
inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt",
).to(model.device)

with torch.inference_mode():
    gen = model.generate(**inputs, max_new_tokens=256, do_sample=False,
                         eos_token_id=[1, 106], pad_token_id=0)
print(processor.decode(gen[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True).strip())

Verified on an RTX 3090 24 GB (an A10G analog), 50 handwritten crops from UkrainianCatholicUniversity/rukopys:

Mode	GPU memory	Handwritten CER	Exact-match
bf16 (reference)	~24 GB (does not fit 24 GB)	0.113	47.7%
8-bit (LLM.int8)	~15 GB	0.199	46.0%
4-bit NF4	~9 GB	0.186	42.0%

Use 8-bit for fidelity closest to bf16, 4-bit when VRAM is tight.

Training

Base: lapa-llm/lapa-v0.1.2-instruct (vision tower frozen; text decoder adapted)
Method: LoRA (PEFT) — r=64, alpha=128, dropout=0.05, bias=none
Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Task type: CAUSAL_LM
Epochs: 5 · LR: 1e-4 · batch: 2 × grad-accum 4 · max_seq_len: 1024
Precision: bf16 · Hardware: 1× H100 80GB
Data: Ukrainian handwritten / printed text crops with literal transcriptions.

Framework versions

PEFT 0.19.1
Transformers (Gemma-3 support: ≥ 4.50; for 24 GB quantized inference use ≥ 4.57 with torch ≥ 2.6)

Downloads last month: 129

Model tree for VmF0x/lapa-ocr-lora

Base model

google/gemma-3-12b-pt

Finetuned

lapa-llm/lapa-12b-pt

Finetuned

lapa-llm/lapa-v0.1.2-instruct

Adapter

(5)

this model