File size: 4,794 Bytes
e968542 a146cff e968542 a146cff 8da1ab8 a146cff e968542 946e1cc 89e6ced e968542 228affd e968542 228affd e968542 cfc37e1 e968542 228affd e968542 228affd e968542 60c7cd1 e968542 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
---
license: other
language:
- en
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- text-generation-inference
- document-ai
- table-extraction
- layouts
- markdown
- html-markdown
- document-retrieval
- visual-grounding
- pdf-ocr
- layout-analysis
---

# **epsilon-ocr-d.markdown-post3.0.m**
> **epsilon-ocr-d.markdown-post3.0.m** is an experimental document AI multimodal model fine tuned on top of **Qwen2.5-VL-3B-Instruct**, optimized for OCR driven document reconstruction and dynamic Markdown generation. It converts documents into structured **Markdown**, **HTML-Markdown**, and hybrid technical documentation formats with inline code adaptation. Built for efficient model scaling, it offers strong performance with reduced compute requirements.
# Key Enhancements
* **Dynamic Markdown and Layout Reconstruction**
Converts multi page and complex layout documents into structured Markdown or HTML-Markdown with preserved hierarchy, formatting, headings, and semantic reading order.
* **Inline Programming Language Support**
Automatically embeds LaTeX, Python, JavaScript, and shell code blocks within reconstructed documentation for research and technical writing.
* **High Accuracy OCR and Visual Parsing**
Extracts text from structured, semi structured, and unstructured formats. Supports multi page input and contextual alignment.
* **Complex Structure Understanding**
Parses tables, forms, graphs, diagrams, multi column layouts, and mathematical expressions without structural loss.
* **Document Retrieval and Semantic Linking**
Performs cross page reasoning and content referencing for enterprise document workflows.
* **Multimodal Long Document Reasoning**
Supports long content comprehension for slides, scanned books, handwritten pages, and research papers.
---
> 👉 This model is a stage progression model, and it may currently contain artifacts.
---
# Quick Start with Transformers
```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"prithivMLmods/epsilon-ocr-d.markdown-post3.0.m", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("prithivMLmods/epsilon-ocr-d.markdown-post3.0.m")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Convert to Markdown."},
],
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=2048)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```
# Intended Use
* OCR to Markdown or HTML Markdown conversion
* Document reconstruction for manuals, books, and research materials
* Table extraction and structural transformation
* Multi page document retrieval and question answering
* Mathematical OCR and LaTeX generation
* Form extraction and structured entity mapping
* Documentation rebuilding for enterprise knowledge systems
* Automation of digitization and archival systems
# Limitations
* Accuracy may drop on highly damaged or extremely low resolution images
* Limited performance compared to larger VL models in very large document reasoning
* Language coverage varies for low resource scripts
* Very complex forms may require secondary refinement
## References
* Qwen2.5 VL
[https://huggingface.co/papers/2502.13923](https://huggingface.co/papers/2502.13923)
* DocVLM Efficient Reader
[https://arxiv.org/pdf/2412.08746v1](https://arxiv.org/pdf/2412.08746v1)
* YaRN Efficient Context Window Extension
[https://arxiv.org/pdf/2309.00071](https://arxiv.org/pdf/2309.00071)
* Qwen2 VL High Resolution Perception
[https://arxiv.org/pdf/2409.12191](https://arxiv.org/pdf/2409.12191)
* Qwen VL Vision Language and OCR
[https://arxiv.org/pdf/2308.12966](https://arxiv.org/pdf/2308.12966)
* OCR Benchmark for Multimodal Models
[https://arxiv.org/pdf/2412.02210](https://arxiv.org/pdf/2412.02210) |