File size: 4,794 Bytes

---
license: other
language:
- en
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- text-generation-inference
- document-ai
- table-extraction
- layouts
- markdown
- html-markdown
- document-retrieval
- visual-grounding
- pdf-ocr
- layout-analysis
---
 
![1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/3BlRVBsY8SFY34bBdwICO.png)

# **epsilon-ocr-d.markdown-post3.0.m**

> **epsilon-ocr-d.markdown-post3.0.m** is an experimental document AI multimodal model fine tuned on top of **Qwen2.5-VL-3B-Instruct**, optimized for OCR driven document reconstruction and dynamic Markdown generation. It converts documents into structured **Markdown**, **HTML-Markdown**, and hybrid technical documentation formats with inline code adaptation. Built for efficient model scaling, it offers strong performance with reduced compute requirements.

# Key Enhancements

* **Dynamic Markdown and Layout Reconstruction**
  Converts multi page and complex layout documents into structured Markdown or HTML-Markdown with preserved hierarchy, formatting, headings, and semantic reading order.

* **Inline Programming Language Support**
  Automatically embeds LaTeX, Python, JavaScript, and shell code blocks within reconstructed documentation for research and technical writing.

* **High Accuracy OCR and Visual Parsing**
  Extracts text from structured, semi structured, and unstructured formats. Supports multi page input and contextual alignment.

* **Complex Structure Understanding**
  Parses tables, forms, graphs, diagrams, multi column layouts, and mathematical expressions without structural loss.

* **Document Retrieval and Semantic Linking**
  Performs cross page reasoning and content referencing for enterprise document workflows.

* **Multimodal Long Document Reasoning**
  Supports long content comprehension for slides, scanned books, handwritten pages, and research papers.

---

> 👉 This model is a stage progression model, and it may currently contain artifacts.

---

# Quick Start with Transformers

```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/epsilon-ocr-d.markdown-post3.0.m", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("prithivMLmods/epsilon-ocr-d.markdown-post3.0.m")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Convert to Markdown."},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=2048)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```

# Intended Use

* OCR to Markdown or HTML Markdown conversion
* Document reconstruction for manuals, books, and research materials
* Table extraction and structural transformation
* Multi page document retrieval and question answering
* Mathematical OCR and LaTeX generation
* Form extraction and structured entity mapping
* Documentation rebuilding for enterprise knowledge systems
* Automation of digitization and archival systems

# Limitations

* Accuracy may drop on highly damaged or extremely low resolution images
* Limited performance compared to larger VL models in very large document reasoning
* Language coverage varies for low resource scripts
* Very complex forms may require secondary refinement

## References

* Qwen2.5 VL
  [https://huggingface.co/papers/2502.13923](https://huggingface.co/papers/2502.13923)

* DocVLM Efficient Reader
  [https://arxiv.org/pdf/2412.08746v1](https://arxiv.org/pdf/2412.08746v1)

* YaRN Efficient Context Window Extension
  [https://arxiv.org/pdf/2309.00071](https://arxiv.org/pdf/2309.00071)

* Qwen2 VL High Resolution Perception
  [https://arxiv.org/pdf/2409.12191](https://arxiv.org/pdf/2409.12191)

* Qwen VL Vision Language and OCR
  [https://arxiv.org/pdf/2308.12966](https://arxiv.org/pdf/2308.12966)

* OCR Benchmark for Multimodal Models
  [https://arxiv.org/pdf/2412.02210](https://arxiv.org/pdf/2412.02210)