TinyDoc-VLM-256M

Smallest document AI that actually works. 256M params. Runs on a MacBook. Apache 2.0.

What is this?

A 256M-parameter vision-language model that reads documents: invoices, receipts, forms, tables, charts. It extracts structured data, answers questions, and parses tables — all from a single model that runs on CPU.

Why does this exist? Most document AI models are 7B+ params and need expensive GPUs. TinyDoc-VLM fits in <1GB VRAM and runs on a MacBook Air, Raspberry Pi 5, or any CPU with ONNX.

Quick Start

pip install tinydoc

from PIL import Image
from tinydoc import TinyDocExtractor

extractor = TinyDocExtractor(device="cpu")

# Ask questions
img = Image.open("invoice.png")
result = extractor.ask(img, "What is the total?")
print(result.answer)  # "$1,234.56"

# Extract structured JSON
result = extractor.extract(img, output_format="json")
print(result.fields)  # {"total": "$1,234.56", "date": "2024-01-15", ...}

# Extract tables
result = extractor.extract_table(img)
print(result.markdown)

Direct Model Access

from tinydoc_vlm import TinyDocVLMForConditionalGeneration, TinyDocVLMProcessor

model = TinyDocVLMForConditionalGeneration.from_pretrained("eulogik/TinyDoc-VLM-256M")
processor = TinyDocVLMProcessor()

Architecture

Image (384×384)
    ↓
SigLIP Vision Encoder (93M)          ← 576 patches × 768 dim
    ↓
Pixel-Shuffle Compressor (scale=3)   ← 9× compression → 64 tokens
    ↓
Visual Position Embeddings
    ↓
SmolLM2 Decoder (135M)               ← 30 layers, GQA (9:3 heads), 8192 ctx
    ↓
Multi-Task Output Heads
    ↓
JSON / KV Extraction / Table / OCR / QA

Total: 256M params | Vision: 93M | Compressor: 3M | Decoder: 135M | Heads: 25M

LoRA Fine-tuning

Train on your own documents with LoRA — only 2.7M params (0.93%) are trainable.

# Generate synthetic docs
python data/synthetic/generator.py --num-docs 1000 --output-dir data/synthetic/output

# Train on M4 Mac (~4.6 hours for 5K steps)
python training/fast_train.py --steps 5000 --device mps

# Train on GPU (~1 hour for 5K steps)
python training/fast_train.py --steps 5000 --device cuda

Colab notebook: training/colab_train.ipynb

Training Results

Metric	Value
Best checkpoint	Step 14,000 (loss: 15.0)
Training data	3,000 synthetic docs (6,815 QA pairs)
Training time	15.1 hours on M4 Mac
LoRA adapter	eulogik/TinyDoc-VLM-LoRA

Deployment

ONNX (Recommended for Production)

python export/export_onnx.py --model-path eulogik/TinyDoc-VLM-256M --output model.onnx

ONNX files on HF Hub:

tinydoc-vlm-vision.onnx — Vision encoder (33KB)
tinydoc-vlm-compressor.onnx — Token compressor (31KB)
tinydoc-vlm-decoder.onnx — Language decoder (59MB)

HuggingFace Spaces

Live demo: huggingface.co/spaces/eulogik/TinyDoc-VLM

Benchmarks

Benchmark	Status	Target
OCRBench	In progress	>75%
DocVQA	Pending	>85%
FUNSD	Pending	>95%

What can it do?

Invoice processing — Extract line items, totals, dates, vendor info
Receipt scanning — Parse store receipts, extract amounts
Form understanding — Read forms, extract field-value pairs
Table extraction — Convert tables to structured data
Document Q&A — Ask questions about any document
OCR — Read printed text from images
Chart understanding — Extract data from charts and graphs

Related Models

eulogik/TinyDoc-VLM-LoRA — LoRA fine-tuned adapter
SmolLM2-135M — Base decoder
SigLIP-B/16 — Base vision encoder

Links

Resource	URL
GitHub	github.com/eulogik/TinyDoc-VLM
PyPI	pypi.org/project/tinydoc
LoRA Checkpoint	huggingface.co/eulogik/TinyDoc-VLM-LoRA
Live Demo	huggingface.co/spaces/eulogik/TinyDoc-VLM

Citation

@software{eulogik_tinydoc_vlm_2026,
  author = {eulogik},
  title = {TinyDoc-VLM: 256M-Param Document-Specialist Vision-Language Model},
  year = {2026},
  url = {https://github.com/eulogik/TinyDoc-VLM}
}

License

Apache 2.0. Free for commercial use.

Built by eulogik — AI infrastructure for document intelligence.

Downloads last month: 399

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for eulogik/TinyDoc-VLM-256M

Adapters

1 model

eulogik
/

TinyDoc-VLM-256M