TinyDoc-VLM-256M

Smallest document AI that actually works. 256M params. Runs on a MacBook. Apache 2.0.

GitHub PyPI HF Space LoRA License

What is this?

A 256M-parameter vision-language model that reads documents: invoices, receipts, forms, tables, charts. It extracts structured data, answers questions, and parses tables โ€” all from a single model that runs on CPU.

Why does this exist? Most document AI models are 7B+ params and need expensive GPUs. TinyDoc-VLM fits in <1GB VRAM and runs on a MacBook Air, Raspberry Pi 5, or any CPU with ONNX.

Quick Start

pip install tinydoc
from PIL import Image
from tinydoc import TinyDocExtractor

extractor = TinyDocExtractor(device="cpu")

# Ask questions
img = Image.open("invoice.png")
result = extractor.ask(img, "What is the total?")
print(result.answer)  # "$1,234.56"

# Extract structured JSON
result = extractor.extract(img, output_format="json")
print(result.fields)  # {"total": "$1,234.56", "date": "2024-01-15", ...}

# Extract tables
result = extractor.extract_table(img)
print(result.markdown)

Direct Model Access

from tinydoc_vlm import TinyDocVLMForConditionalGeneration, TinyDocVLMProcessor

model = TinyDocVLMForConditionalGeneration.from_pretrained("eulogik/TinyDoc-VLM-256M")
processor = TinyDocVLMProcessor()

Architecture

Image (384ร—384)
    โ†“
SigLIP Vision Encoder (93M)          โ† 576 patches ร— 768 dim
    โ†“
Pixel-Shuffle Compressor (scale=3)   โ† 9ร— compression โ†’ 64 tokens
    โ†“
Visual Position Embeddings
    โ†“
SmolLM2 Decoder (135M)               โ† 30 layers, GQA (9:3 heads), 8192 ctx
    โ†“
Multi-Task Output Heads
    โ†“
JSON / KV Extraction / Table / OCR / QA

Total: 256M params | Vision: 93M | Compressor: 3M | Decoder: 135M | Heads: 25M

LoRA Fine-tuning

Train on your own documents with LoRA โ€” only 2.7M params (0.93%) are trainable.

# Generate synthetic docs
python data/synthetic/generator.py --num-docs 1000 --output-dir data/synthetic/output

# Train on M4 Mac (~4.6 hours for 5K steps)
python training/fast_train.py --steps 5000 --device mps

# Train on GPU (~1 hour for 5K steps)
python training/fast_train.py --steps 5000 --device cuda

Colab notebook: training/colab_train.ipynb

Training Results

Metric Value
Best checkpoint Step 14,000 (loss: 15.0)
Training data 3,000 synthetic docs (6,815 QA pairs)
Training time 15.1 hours on M4 Mac
LoRA adapter eulogik/TinyDoc-VLM-LoRA

Deployment

ONNX (Recommended for Production)

python export/export_onnx.py --model-path eulogik/TinyDoc-VLM-256M --output model.onnx

ONNX files on HF Hub:

  • tinydoc-vlm-vision.onnx โ€” Vision encoder (33KB)
  • tinydoc-vlm-compressor.onnx โ€” Token compressor (31KB)
  • tinydoc-vlm-decoder.onnx โ€” Language decoder (59MB)

HuggingFace Spaces

Live demo: huggingface.co/spaces/eulogik/TinyDoc-VLM

Benchmarks

Benchmark Status Target
OCRBench In progress >75%
DocVQA Pending >85%
FUNSD Pending >95%

What can it do?

  • Invoice processing โ€” Extract line items, totals, dates, vendor info
  • Receipt scanning โ€” Parse store receipts, extract amounts
  • Form understanding โ€” Read forms, extract field-value pairs
  • Table extraction โ€” Convert tables to structured data
  • Document Q&A โ€” Ask questions about any document
  • OCR โ€” Read printed text from images
  • Chart understanding โ€” Extract data from charts and graphs

Related Models

Links

Citation

@software{eulogik_tinydoc_vlm_2026,
  author = {eulogik},
  title = {TinyDoc-VLM: 256M-Param Document-Specialist Vision-Language Model},
  year = {2026},
  url = {https://github.com/eulogik/TinyDoc-VLM}
}

License

Apache 2.0. Free for commercial use.


Built by eulogik โ€” AI infrastructure for document intelligence.

Downloads last month
399
Safetensors
Model size
0.3B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for eulogik/TinyDoc-VLM-256M

Adapters
1 model

Space using eulogik/TinyDoc-VLM-256M 1