rednote-hilab/dots.ocr · Improve model card: update library_name, add paper, code, project links and citation

Improve model card: update library_name, add paper, code, project links and citation

#40

by nielsr HF Staff - opened Dec 3, 2025

base: refs/heads/main

←

from: refs/pr/40

Discussion Files changed

+67

-51

Files changed (1) hide show

README.md +67 -51

README.md CHANGED Viewed

@@ -1,6 +1,10 @@
 ---
 license: mit
-library_name: dots_ocr
 pipeline_tag: image-text-to-text
 tags:
 - image-to-text
@@ -11,10 +15,6 @@ tags:
 - formula
 - transformers
 - custom_code
-language:
-- en
-- zh
-- multilingual
 ---
 <div align="center">
@@ -24,23 +24,25 @@ language:
 <p>
 <h1 align="center">
-dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model
 </h1>
-[![Blog](https://img.shields.io/badge/Blog-View_on_GitHub-333.svg?logo=github)](https://github.com/rednote-hilab/dots.ocr/blob/master/assets/blog.md)
-[![HuggingFace](https://img.shields.io/badge/HuggingFace%20Weights-black.svg?logo=HuggingFace)](https://huggingface.co/rednote-hilab/dots.ocr)
 <div align="center">
-  <a href="https://dotsocr.xiaohongshu.com" target="_blank" rel="noopener noreferrer"><strong>🖥️ Live Demo</strong></a> |
-  <a href="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/wechat.jpg" target="_blank" rel="noopener noreferrer"><strong>💬 WeChat</strong></a> |
-  <a href="https://www.xiaohongshu.com/user/profile/683ffe42000000001d021a4c" target="_blank" rel="noopener noreferrer"><strong>📕 rednote</strong></a>
 </div>
 </div>
 ## Introduction
 **dots.ocr** is a powerful, multilingual document parser that unifies layout detection and content recognition within a single vision-language model while maintaining good reading order. Despite its compact 1.7B-parameter LLM foundation, it achieves state-of-the-art(SOTA) performance.
@@ -104,8 +106,8 @@ messages = [
 # Preparation for inference
 text = processor.apply_chat_template(
-    messages,
-    tokenize=False,
     add_generation_prompt=True
 )
 image_inputs, video_inputs = process_vision_info(messages)
@@ -133,11 +135,12 @@ print(output_text)
 ### Performance Comparison: dots.ocr vs. Competing Models
 <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/chart.png" border="0" />
-> **Notes:**
 > - The EN, ZH metrics are the end2end evaluation results of [OmniDocBench](https://github.com/opendatalab/OmniDocBench), and Multilingual metric is the end2end evaluation results of dots.ocr-bench.
-## News
 * ```2025.07.30 ``` 🚀 We release [dots.ocr](https://github.com/rednote-hilab/dots.ocr), — a multilingual documents parsing model based on 1.7b llm, with SOTA performance.
@@ -433,7 +436,6 @@ print(output_text)
 <td>0.100</td>
 <td>0.185</td>
 </tr>
-<tr>
 <td rowspan="5"><strong>General<br>VLMs</strong></td>
 <td>GPT4o</td>
@@ -504,7 +506,7 @@ print(output_text)
       <td>0.295</td>
       <td><strong>0.384</strong></td>
       <td>83.3</td>
-      <td><strong>89.3</strong></td>
       <td>0.165</td>
       <td><strong>0.085</strong></td>
       <td>0.058</td>
@@ -728,7 +730,7 @@ print(output_text)
 </tbody>
 </table>
-> **Notes:**
 > - The metrics are from [MonkeyOCR](https://github.com/Yuliang-Liu/MonkeyOCR), [OmniDocBench](https://github.com/opendatalab/OmniDocBench), and our own internal evaluations.
 > - We delete the Page-header and Page-footer cells in the result markdown.
 > - We use tikz_preprocess pipeline to upsample the images to dpi 200.
@@ -801,7 +803,7 @@ This is an inhouse benchmark which contain 1493 pdf images with 100 languages.
 </tbody>
 </table>
-> **Notes:**
 > - We use the same metric calculation pipeline of [OmniDocBench](https://github.com/opendatalab/OmniDocBench).
 > - We delete the Page-header and Page-footer cells in the result markdown.
@@ -873,7 +875,7 @@ This is an inhouse benchmark which contain 1493 pdf images with 100 languages.
 </tbody>
 </table>
-> **Notes:**
 > - prompt_layout_all_en for **parse all**, prompt_layout_only_en for **detection only**, please refer to [prompts](https://github.com/rednote-hilab/dots.ocr/blob/master/dots_ocr/utils/prompts.py)
@@ -1080,7 +1082,7 @@ This is an inhouse benchmark which contain 1493 pdf images with 100 languages.
 > **Note:**
-> - The metrics are from [MonkeyOCR](https://github.com/Yuliang-Liu/MonkeyOCR),
 [olmocr](https://github.com/allenai/olmocr), and our own internal evaluations.
 > - We delete the Page-header and Page-footer cells in the result markdown.
@@ -1113,28 +1115,23 @@ pip install -e .
 > 💡**Note:** Please use a directory name without periods (e.g., `DotsOCR` instead of `dots.ocr`) for the model save path. This is a temporary workaround pending our integration with Transformers.
 ```shell
 python3 tools/download_model.py
 ```
 ## 2. Deployment
 ### vLLM inference
-We highly recommend using vllm for deployment and inference. All of our evaluations results are based on vllm version 0.9.1.
-The [Docker Image](https://hub.docker.com/r/rednotehilab/dots.ocr) is based on the official vllm image. You can also follow [Dockerfile](https://github.com/rednote-hilab/dots.ocr/blob/master/docker/Dockerfile) to build the deployment environment by yourself.
 ```shell
-# You need to register model to vllm at first
-python3 tools/download_model.py
-export hf_model_path=./weights/DotsOCR  # Path to your downloaded model weights, Please use a directory name without periods (e.g., `DotsOCR` instead of `dots.ocr`) for the model save path. This is a temporary workaround pending our integration with Transformers.
-export PYTHONPATH=$(dirname "$hf_model_path"):$PYTHONPATH
-sed -i '/^from vllm\.entrypoints\.cli\.main import main$/a\
-from DotsOCR import modeling_dots_ocr_vllm' `which vllm`  # If you downloaded model weights by yourself, please replace `DotsOCR` by your model saved directory name, and remember to use a directory name without periods (e.g., `DotsOCR` instead of `dots.ocr`)
-# launch vllm server
-CUDA_VISIBLE_DEVICES=0 vllm serve ${hf_model_path} --tensor-parallel-size 1 --gpu-memory-utilization 0.95  --chat-template-content-format string --served-model-name model --trust-remote-code
-# If you get a ModuleNotFoundError: No module named 'DotsOCR', please check the note above on the saved model directory name.
-# vllm api demo
 python3 ./demo/demo_vllm.py --prompt_mode prompt_layout_all_en
 ```
@@ -1197,8 +1194,8 @@ messages = [
 # Preparation for inference
 text = processor.apply_chat_template(
-    messages,
-    tokenize=False,
     add_generation_prompt=True
 )
 image_inputs, video_inputs = process_vision_info(messages)
@@ -1226,6 +1223,10 @@ print(output_text)
 </details>
 ## 3. Document Parse
 **Based on vLLM server**, you can parse an image or a pdf file using the following commands:
 ```bash
@@ -1234,7 +1235,7 @@ print(output_text)
 # Parse a single image
 python3 dots_ocr/parser.py demo/demo_image1.jpg
 # Parse a single PDF
-python3 dots_ocr/parser.py demo/demo_pdf1.pdf  --num_threads 64  # try bigger num_threads for pdf with a large number of pages
 # Layout detection only
 python3 dots_ocr/parser.py demo/demo_image1.jpg --prompt prompt_layout_only_en
@@ -1246,6 +1247,9 @@ python3 dots_ocr/parser.py demo/demo_image1.jpg --prompt prompt_ocr
 python3 dots_ocr/parser.py demo/demo_image1.jpg --prompt prompt_grounding_ocr --bbox 163 241 1536 705
 ```
 <details>
 <summary><b>Output Results</b></summary>
@@ -1294,22 +1298,34 @@ python demo/demo_gradio_annotion.py
 ## Acknowledgments
-We would like to thank [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL), [aimv2](https://github.com/apple/ml-aim), [MonkeyOCR](https://github.com/Yuliang-Liu/MonkeyOCR),
-[OmniDocBench](https://github.com/opendatalab/OmniDocBench), [PyMuPDF](https://github.com/pymupdf/PyMuPDF), for providing code and models.
-We also thank [DocLayNet](https://github.com/DS4SD/DocLayNet), [M6Doc](https://github.com/HCIILAB/M6Doc), [CDLA](https://github.com/buptlihang/CDLA), [D4LA](https://github.com/AlibabaResearch/AdvancedLiterateMachinery) for providing valuable datasets.
 ## Limitation & Future Work
-- **Complex Document Elements:**
-  - **Table&Formula**: dots.ocr is not yet perfect for high-complexity tables and formula extraction.
-  - **Picture**: Pictures in documents are currently not parsed.
-- **Parsing Failures:** The model may fail to parse under certain conditions:
-  - When the character-to-pixel ratio is excessively high. Try enlarging the image or increasing the PDF parsing DPI (a setting of 200 is recommended). However, please note that the model performs optimally on images with a resolution under 11289600 pixels.
-  - Continuous special characters, such as ellipses (`...`) and underscores (`_`), may cause the prediction output to repeat endlessly. In such scenarios, consider using alternative prompts like `prompt_layout_only_en`, `prompt_ocr`, or `prompt_grounding_ocr` ([details here](https://github.com/rednote-hilab/dots.ocr/blob/master/dots_ocr/utils/prompts.py)).
-- **Performance Bottleneck:** Despite its 1.7B parameter LLM foundation, **dots.ocr** is not yet optimized for high-throughput processing of large PDF volumes.
 We are committed to achieving more accurate table and formula parsing, as well as enhancing the model's OCR capabilities for broader generalization, all while aiming for **a more powerful, more efficient model**. Furthermore, we are actively considering the development of **a more general-purpose perception model** based on Vision-Language Models (VLMs), which would integrate general detection, image captioning, and OCR tasks into a unified framework. **Parsing the content of the pictures in the documents** is also a key priority for our future work.
-We believe that collaboration is the key to tackling these exciting challenges. If you are passionate about advancing the frontiers of document intelligence and are interested in contributing to these future endeavors, we would love to hear from you. Please reach out to us via email at: [[email protected]].

 ---
+language:
+- en
+- zh
+- multilingual
+library_name: transformers
 license: mit
 pipeline_tag: image-text-to-text
 tags:
 - image-to-text
 - formula
 - transformers
 - custom_code
 ---
 <div align="center">
 <p>
 <h1 align="center">
+<a href="https://huggingface.co/papers/2512.02498">dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model</a>
 </h1>
+<p align="center">
+    <a href="https://huggingface.co/papers/2512.02498"><img src="https://img.shields.io/badge/Paper-HF_Link-b31b1b.svg" alt="Hugging Face Paper"></a>
+    <a href="https://github.com/rednote-hilab/dots.ocr"><img src="https://img.shields.io/badge/GitHub-Code-blue.svg?logo=github" alt="GitHub Code"></a>
+    <a href="https://dotsocr.xiaohongshu.com"><img src="https://img.shields.io/badge/Project_Page-Live_Demo-green.svg" alt="Live Demo"></a>
+</p>
 <div align="center">
+  <a href="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/wechat.jpg" target="_blank" rel="noopener noreferrer"><strong>💬 WeChat</strong></a> |
+  <a href="https://www.xiaohongshu.com/user/profile/683ffe42000000001d021a4c" target="_blank" rel="noopener noreferrer"><strong>📕 rednote</strong></a> |
+  <a href="https://x.com/rednotehilab" target="_blank" rel="noopener noreferrer"><strong>🐦 X</strong></a>
 </div>
 </div>
 ## Introduction
 **dots.ocr** is a powerful, multilingual document parser that unifies layout detection and content recognition within a single vision-language model while maintaining good reading order. Despite its compact 1.7B-parameter LLM foundation, it achieves state-of-the-art(SOTA) performance.
 # Preparation for inference
 text = processor.apply_chat_template(
+    messages,
+    tokenize=False,
     add_generation_prompt=True
 )
 image_inputs, video_inputs = process_vision_info(messages)
 ### Performance Comparison: dots.ocr vs. Competing Models
 <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/chart.png" border="0" />
+> **Notes:**
 > - The EN, ZH metrics are the end2end evaluation results of [OmniDocBench](https://github.com/opendatalab/OmniDocBench), and Multilingual metric is the end2end evaluation results of dots.ocr-bench.
+## News
+* ```2025.10.31 ``` 🚀 We release [dots.ocr.base](https://huggingface.co/rednote-hilab/dots.ocr.base), foundation VLM focus on OCR tasks, also the base model of [dots.ocr](https://github.com/rednote-hilab/dots.ocr). Try it out!
 * ```2025.07.30 ``` 🚀 We release [dots.ocr](https://github.com/rednote-hilab/dots.ocr), — a multilingual documents parsing model based on 1.7b llm, with SOTA performance.
 <td>0.100</td>
 <td>0.185</td>
 </tr>
 <td rowspan="5"><strong>General<br>VLMs</strong></td>
 <td>GPT4o</td>
       <td>0.295</td>
       <td><strong>0.384</strong></td>
       <td>83.3</td>
+<td><strong>89.3</strong></td>
       <td>0.165</td>
       <td><strong>0.085</strong></td>
       <td>0.058</td>
 </tbody>
 </table>
+> **Notes:**
 > - The metrics are from [MonkeyOCR](https://github.com/Yuliang-Liu/MonkeyOCR), [OmniDocBench](https://github.com/opendatalab/OmniDocBench), and our own internal evaluations.
 > - We delete the Page-header and Page-footer cells in the result markdown.
 > - We use tikz_preprocess pipeline to upsample the images to dpi 200.
 </tbody>
 </table>
+> **Notes:**
 > - We use the same metric calculation pipeline of [OmniDocBench](https://github.com/opendatalab/OmniDocBench).
 > - We delete the Page-header and Page-footer cells in the result markdown.
 </tbody>
 </table>
+> **Notes:**
 > - prompt_layout_all_en for **parse all**, prompt_layout_only_en for **detection only**, please refer to [prompts](https://github.com/rednote-hilab/dots.ocr/blob/master/dots_ocr/utils/prompts.py)
 > **Note:**
+> - The metrics are from [MonkeyOCR](https://github.com/Yuliang-Liu/MonkeyOCR),
 [olmocr](https://github.com/allenai/olmocr), and our own internal evaluations.
 > - We delete the Page-header and Page-footer cells in the result markdown.
 > 💡**Note:** Please use a directory name without periods (e.g., `DotsOCR` instead of `dots.ocr`) for the model save path. This is a temporary workaround pending our integration with Transformers.
 ```shell
 python3 tools/download_model.py
+# with modelscope
+python3 tools/download_model.py --type modelscope
 ```
 ## 2. Deployment
 ### vLLM inference
+We highly recommend using vLLM for deployment and inference. All of our evaluations results are based on vLLM 0.9.1 via out-of-tree model registration. **Since vLLM version 0.11.0, Dots OCR has been officially integrated into vLLM with verified performance** and you can use vLLM docker image directly (e.g, `vllm/vllm-openai:v0.11.0`) to deploy the model server.
 ```shell
+# Launch vLLM model server
+vllm serve rednote-hilab/dots.ocr --trust-remote-code --async-scheduling --gpu-memory-utilization 0.95
+# vLLM API Demo
+# See dots_ocr/model/inference.py for details on parameter and prompt settings
+# that help achieve the best output quality.
 python3 ./demo/demo_vllm.py --prompt_mode prompt_layout_all_en
 ```
 # Preparation for inference
 text = processor.apply_chat_template(
+    messages,
+    tokenize=False,
     add_generation_prompt=True
 )
 image_inputs, video_inputs = process_vision_info(messages)
 </details>
+### Hugginface inference with CPU
+Please refer to [CPU inference](https://github.com/rednote-hilab/dots.ocr/issues/1#issuecomment-3148962536)
 ## 3. Document Parse
 **Based on vLLM server**, you can parse an image or a pdf file using the following commands:
 ```bash
 # Parse a single image
 python3 dots_ocr/parser.py demo/demo_image1.jpg
 # Parse a single PDF
+python3 dots_ocr/parser.py demo/demo_pdf1.pdf  --num_thread 64  # try bigger num_threads for pdf with a large number of pages
 # Layout detection only
 python3 dots_ocr/parser.py demo/demo_image1.jpg --prompt prompt_layout_only_en
 python3 dots_ocr/parser.py demo/demo_image1.jpg --prompt prompt_grounding_ocr --bbox 163 241 1536 705
 ```
+**Based on Transformers**, you can parse an image or a pdf file using the same commands above, just add `--use_hf true`.
+> Notice: transformers is slower than vllm, if you want to use demo/* with transformers，just add `use_hf=True` in `DotsOCRParser(..,use_hf=True)`
 <details>
 <summary><b>Output Results</b></summary>
 ## Acknowledgments
+We would like to thank [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL), [aimv2](https://github.com/apple/ml-aim), [MonkeyOCR](https://github.com/Yuliang-Liu/MonkeyOCR),
+[OmniDocBench](https://github.com/opendatalab/OmniDocBench), [PyMuPDF](https://github.com/pymupdf/PyMuPDF), for providing code and models.
+We also thank [DocLayNet](https://github.com/DS4SD/DocLayNet), [M6Doc](https://github.com/HCIILAB/M6Doc), [CDLA](https://github.com/buptlihang/CDLA), [D4LA](https://github.com/AlibabaResearch/AdvancedLiterateMachinery) for providing valuable datasets.
 ## Limitation & Future Work
+-   **Complex Document Elements:**
+    -   **Table&Formula**: dots.ocr is not yet perfect for high-complexity tables and formula extraction.
+    -   **Picture**: Pictures in documents are currently not parsed.
+-   **Parsing Failures:** The model may fail to parse under certain conditions:
+    -   When the character-to-pixel ratio is excessively high. Try enlarging the image or increasing the PDF parsing DPI (a setting of 200 is recommended). However, please note that the model performs optimally on images with a resolution under 11289600 pixels.
+    -   Continuous special characters, such as ellipses (`...`) and underscores (`_`), may cause the prediction output to repeat endlessly. In such scenarios, consider using alternative prompts like `prompt_layout_only_en`, `prompt_ocr`, or `prompt_grounding_ocr` ([details here](https://github.com/rednote-hilab/dots.ocr/blob/master/dots_ocr/utils/prompts.py)).
+-   **Performance Bottleneck:** Despite its 1.7B parameter LLM foundation, **dots.ocr** is not yet optimized for high-throughput processing of large PDF volumes.
 We are committed to achieving more accurate table and formula parsing, as well as enhancing the model's OCR capabilities for broader generalization, all while aiming for **a more powerful, more efficient model**. Furthermore, we are actively considering the development of **a more general-purpose perception model** based on Vision-Language Models (VLMs), which would integrate general detection, image captioning, and OCR tasks into a unified framework. **Parsing the content of the pictures in the documents** is also a key priority for our future work.
+We believe that collaboration is the key to tackling these exciting challenges. If you are passionate about advancing the frontiers of document intelligence and are interested in contributing to these future endeavors, we would love to hear from you. Please reach out to us via email at: [[email protected]].
+## Citation
+If you find our work helpful or inspiring, please feel free to cite it.
+```bibtex
+@article{dots.ocr,
+  title={dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model},
+  author={[Anonymous Authors]},
+  booktitle={CVPR},
+  year={2025},
+  url={https://huggingface.co/papers/2512.02498}
+}
+```