--- license: other language: - en base_model: - Qwen/Qwen2.5-VL-3B-Instruct pipeline_tag: image-text-to-text library_name: transformers tags: - text-generation-inference - document-ai - table-extraction - layouts - markdown - html-markdown - document-retrieval - visual-grounding - pdf-ocr - layout-analysis --- ![1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/3BlRVBsY8SFY34bBdwICO.png) # **epsilon-ocr-d.markdown-post3.0.m** > **epsilon-ocr-d.markdown-post3.0.m** is an experimental document AI multimodal model fine tuned on top of **Qwen2.5-VL-3B-Instruct**, optimized for OCR driven document reconstruction and dynamic Markdown generation. It converts documents into structured **Markdown**, **HTML-Markdown**, and hybrid technical documentation formats with inline code adaptation. Built for efficient model scaling, it offers strong performance with reduced compute requirements. # Key Enhancements * **Dynamic Markdown and Layout Reconstruction** Converts multi page and complex layout documents into structured Markdown or HTML-Markdown with preserved hierarchy, formatting, headings, and semantic reading order. * **Inline Programming Language Support** Automatically embeds LaTeX, Python, JavaScript, and shell code blocks within reconstructed documentation for research and technical writing. * **High Accuracy OCR and Visual Parsing** Extracts text from structured, semi structured, and unstructured formats. Supports multi page input and contextual alignment. * **Complex Structure Understanding** Parses tables, forms, graphs, diagrams, multi column layouts, and mathematical expressions without structural loss. * **Document Retrieval and Semantic Linking** Performs cross page reasoning and content referencing for enterprise document workflows. * **Multimodal Long Document Reasoning** Supports long content comprehension for slides, scanned books, handwritten pages, and research papers. --- > 👉 This model is a stage progression model, and it may currently contain artifacts. --- # Quick Start with Transformers ```python from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor from qwen_vl_utils import process_vision_info model = Qwen2_5_VLForConditionalGeneration.from_pretrained( "prithivMLmods/epsilon-ocr-d.markdown-post3.0.m", torch_dtype="auto", device_map="auto" ) processor = AutoProcessor.from_pretrained("prithivMLmods/epsilon-ocr-d.markdown-post3.0.m") messages = [ { "role": "user", "content": [ { "type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", }, {"type": "text", "text": "Convert to Markdown."}, ], } ] text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to("cuda") generated_ids = model.generate(**inputs, max_new_tokens=2048) generated_ids_trimmed = [ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text) ``` # Intended Use * OCR to Markdown or HTML Markdown conversion * Document reconstruction for manuals, books, and research materials * Table extraction and structural transformation * Multi page document retrieval and question answering * Mathematical OCR and LaTeX generation * Form extraction and structured entity mapping * Documentation rebuilding for enterprise knowledge systems * Automation of digitization and archival systems # Limitations * Accuracy may drop on highly damaged or extremely low resolution images * Limited performance compared to larger VL models in very large document reasoning * Language coverage varies for low resource scripts * Very complex forms may require secondary refinement ## References * Qwen2.5 VL [https://huggingface.co/papers/2502.13923](https://huggingface.co/papers/2502.13923) * DocVLM Efficient Reader [https://arxiv.org/pdf/2412.08746v1](https://arxiv.org/pdf/2412.08746v1) * YaRN Efficient Context Window Extension [https://arxiv.org/pdf/2309.00071](https://arxiv.org/pdf/2309.00071) * Qwen2 VL High Resolution Perception [https://arxiv.org/pdf/2409.12191](https://arxiv.org/pdf/2409.12191) * Qwen VL Vision Language and OCR [https://arxiv.org/pdf/2308.12966](https://arxiv.org/pdf/2308.12966) * OCR Benchmark for Multimodal Models [https://arxiv.org/pdf/2412.02210](https://arxiv.org/pdf/2412.02210)