Instructions to use VmF0x/lapa-ocr-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use VmF0x/lapa-ocr-lora with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("lapa-llm/lapa-v0.1.2-instruct") model = PeftModel.from_pretrained(base_model, "VmF0x/lapa-ocr-lora") - Notebooks
- Google Colab
- Kaggle
Lapa Ukrainian Handwriting OCR — LoRA Adapter
LoRA adapter on top of lapa-llm/lapa-v0.1.2-instruct
(a Gemma-3-12B Ukrainian vision-language model) for Ukrainian handwritten-text
recognition (HTR / OCR) on document crops.
The base Lapa model, applied zero-shot to handwriting crops, tends to paraphrase rather than transcribe literally. This adapter retrains the text decoder to emit a literal transcription of the text in the image. It was developed as an OCR component for a Ukrainian HTR pipeline (handwritten + printed regions, math formulas).
Results (internal validation)
| Metric | Base Lapa (bf16) | + this LoRA |
|---|---|---|
| Handwritten CER | 3.28 | 0.113 |
| Handwritten exact-match | 1.3% | 47.7% |
| Printed CER | 1.08 | 0.187 |
CER > 1 on the base reflects heavy paraphrasing (output far longer than ground truth). The adapter removes that behavior and produces faithful transcriptions.
Intended use
- Transcribing Ukrainian handwritten / printed text crops (region-level images, not full pages) into plain text.
- As a cross-vote / ensemble OCR partner alongside other VLMs.
Not tuned for: full-page layout, non-Ukrainian scripts, or marginal / very low-quality regions (CER rises to ~0.55 on hard, low-confidence regions).
How to use
import torch
from PIL import Image
from peft import PeftModel
from transformers import AutoModelForImageTextToText, AutoProcessor
BASE = "lapa-llm/lapa-v0.1.2-instruct"
ADAPTER = "lapa-llm/lapa-ocr-lora" # this repo
base = AutoModelForImageTextToText.from_pretrained(
BASE,
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="sdpa",
)
model = PeftModel.from_pretrained(base, ADAPTER).eval()
processor = AutoProcessor.from_pretrained(BASE)
PROMPT = "Transcribe Ukrainian text literally. Output only the text, no preamble."
img = Image.open("crop.png").convert("RGB")
messages = [{
"role": "user",
"content": [
{"type": "image", "image": img},
{"type": "text", "text": PROMPT},
],
}]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt", padding=True,
).to(model.device, dtype=torch.bfloat16)
with torch.inference_mode():
gen = model.generate(**inputs, max_new_tokens=256, do_sample=False, num_beams=1)
text = processor.batch_decode(
gen[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True
)[0].strip()
print(text)
Run on a single 24 GB GPU (A10G / RTX 3090 / A5000 / L4)
The base is a 12B model — bf16 weights are ~24 GB and will not fit a 24 GB card alongside the KV cache, so you must quantize. Naive 4/8-bit loading often produces empty output / a repeated pad token (id 0). That symptom is almost always an environment issue, not a model or adapter defect. Two requirements are easy to miss:
torch >= 2.6is mandatory. Withtransformers4.57, Gemma 3 builds its bidirectional image-attention mask withor_mask_function, which raisesValueError: Using or_mask_function ... require torch>=2.6on torch 2.5.x — for botheagerandsdpa. On older torch every forward pass dies and you get empty / garbage output.- A C compiler must be present.
bitsandbytes≥ 0.49 pullstriton, which JIT-compiles a CUDA helper at import. Withoutgccyou getRuntimeError: Failed to find C compiler, surfaced confusingly asModuleNotFoundError: validate_bnb_backend_availability.
Environment:
apt-get update && apt-get install -y build-essential # gcc, for triton's JIT
pip install -U "torch>=2.6" torchvision --index-url https://download.pytorch.org/whl/cu124
pip install -U "transformers>=4.57" "peft>=0.19" "accelerate>=1.0" \
"bitsandbytes>=0.49" pillow sentencepiece
Load quantized — keep the vision tower, projector and embeddings out of quantization, use bfloat16 compute (never fp16 — Gemma 3 overflows), and reinstate the stop tokens:
import torch
from PIL import Image
from peft import PeftModel
from transformers import AutoModelForImageTextToText, AutoProcessor, BitsAndBytesConfig
BASE = "lapa-llm/lapa-v0.1.2-instruct"
ADAPTER = "lapa-llm/lapa-ocr-lora" # this repo
# 4-bit NF4 (~9 GB). For ~bf16 fidelity use load_in_8bit=True instead (~15 GB).
bnb = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16, # bf16, NOT fp16
llm_int8_skip_modules=["vision_tower", "multi_modal_projector", "lm_head", "embed_tokens"],
)
model = AutoModelForImageTextToText.from_pretrained(
BASE, quantization_config=bnb, torch_dtype=torch.bfloat16,
device_map="auto", attn_implementation="eager",
)
model = PeftModel.from_pretrained(model, ADAPTER).eval()
# Some load paths drop the generation config (runtime eos_token_id=None); reinstate it.
model.generation_config.eos_token_id = [1, 106] # <eos>, <end_of_turn>
model.generation_config.pad_token_id = 0
processor = AutoProcessor.from_pretrained(BASE)
PROMPT = "Transcribe Ukrainian text literally. Output only the text, no preamble."
img = Image.open("crop.png").convert("RGB")
messages = [{"role": "user", "content": [
{"type": "image", "image": img}, {"type": "text", "text": PROMPT}]}]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt",
).to(model.device)
with torch.inference_mode():
gen = model.generate(**inputs, max_new_tokens=256, do_sample=False,
eos_token_id=[1, 106], pad_token_id=0)
print(processor.decode(gen[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True).strip())
Verified on an RTX 3090 24 GB (an A10G analog), 50 handwritten crops from
UkrainianCatholicUniversity/rukopys:
| Mode | GPU memory | Handwritten CER | Exact-match |
|---|---|---|---|
| bf16 (reference) | ~24 GB (does not fit 24 GB) | 0.113 | 47.7% |
| 8-bit (LLM.int8) | ~15 GB | 0.199 | 46.0% |
| 4-bit NF4 | ~9 GB | 0.186 | 42.0% |
Use 8-bit for fidelity closest to bf16, 4-bit when VRAM is tight.
Training
- Base:
lapa-llm/lapa-v0.1.2-instruct(vision tower frozen; text decoder adapted) - Method: LoRA (PEFT) — r=64, alpha=128, dropout=0.05, bias=none
- Target modules:
q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj - Task type:
CAUSAL_LM - Epochs: 5 · LR: 1e-4 · batch: 2 × grad-accum 4 · max_seq_len: 1024
- Precision: bf16 · Hardware: 1× H100 80GB
- Data: Ukrainian handwritten / printed text crops with literal transcriptions.
Framework versions
- PEFT 0.19.1
- Transformers (Gemma-3 support: ≥ 4.50; for 24 GB quantized inference use ≥ 4.57 with torch ≥ 2.6)
- Downloads last month
- 129
Model tree for VmF0x/lapa-ocr-lora
Base model
google/gemma-3-12b-pt