File size: 4,794 Bytes
e968542
a146cff
e968542
 
a146cff
 
8da1ab8
a146cff
 
 
 
 
 
 
 
 
 
 
 
e968542
946e1cc
89e6ced
e968542
228affd
e968542
228affd
e968542
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cfc37e1
 
 
 
 
 
e968542
 
 
 
 
 
 
228affd
e968542
 
228affd
e968542
 
 
 
 
 
 
 
 
60c7cd1
e968542
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
license: other
language:
- en
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- text-generation-inference
- document-ai
- table-extraction
- layouts
- markdown
- html-markdown
- document-retrieval
- visual-grounding
- pdf-ocr
- layout-analysis
---
 
![1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/3BlRVBsY8SFY34bBdwICO.png)

# **epsilon-ocr-d.markdown-post3.0.m**

> **epsilon-ocr-d.markdown-post3.0.m** is an experimental document AI multimodal model fine tuned on top of **Qwen2.5-VL-3B-Instruct**, optimized for OCR driven document reconstruction and dynamic Markdown generation. It converts documents into structured **Markdown**, **HTML-Markdown**, and hybrid technical documentation formats with inline code adaptation. Built for efficient model scaling, it offers strong performance with reduced compute requirements.

# Key Enhancements

* **Dynamic Markdown and Layout Reconstruction**
  Converts multi page and complex layout documents into structured Markdown or HTML-Markdown with preserved hierarchy, formatting, headings, and semantic reading order.

* **Inline Programming Language Support**
  Automatically embeds LaTeX, Python, JavaScript, and shell code blocks within reconstructed documentation for research and technical writing.

* **High Accuracy OCR and Visual Parsing**
  Extracts text from structured, semi structured, and unstructured formats. Supports multi page input and contextual alignment.

* **Complex Structure Understanding**
  Parses tables, forms, graphs, diagrams, multi column layouts, and mathematical expressions without structural loss.

* **Document Retrieval and Semantic Linking**
  Performs cross page reasoning and content referencing for enterprise document workflows.

* **Multimodal Long Document Reasoning**
  Supports long content comprehension for slides, scanned books, handwritten pages, and research papers.

---

> 👉 This model is a stage progression model, and it may currently contain artifacts.

---

# Quick Start with Transformers

```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/epsilon-ocr-d.markdown-post3.0.m", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("prithivMLmods/epsilon-ocr-d.markdown-post3.0.m")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Convert to Markdown."},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=2048)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```

# Intended Use

* OCR to Markdown or HTML Markdown conversion
* Document reconstruction for manuals, books, and research materials
* Table extraction and structural transformation
* Multi page document retrieval and question answering
* Mathematical OCR and LaTeX generation
* Form extraction and structured entity mapping
* Documentation rebuilding for enterprise knowledge systems
* Automation of digitization and archival systems

# Limitations

* Accuracy may drop on highly damaged or extremely low resolution images
* Limited performance compared to larger VL models in very large document reasoning
* Language coverage varies for low resource scripts
* Very complex forms may require secondary refinement

## References

* Qwen2.5 VL
  [https://huggingface.co/papers/2502.13923](https://huggingface.co/papers/2502.13923)

* DocVLM Efficient Reader
  [https://arxiv.org/pdf/2412.08746v1](https://arxiv.org/pdf/2412.08746v1)

* YaRN Efficient Context Window Extension
  [https://arxiv.org/pdf/2309.00071](https://arxiv.org/pdf/2309.00071)

* Qwen2 VL High Resolution Perception
  [https://arxiv.org/pdf/2409.12191](https://arxiv.org/pdf/2409.12191)

* Qwen VL Vision Language and OCR
  [https://arxiv.org/pdf/2308.12966](https://arxiv.org/pdf/2308.12966)

* OCR Benchmark for Multimodal Models
  [https://arxiv.org/pdf/2412.02210](https://arxiv.org/pdf/2412.02210)