Improve model card: update library_name, add paper, code, project links and citation

#40
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +67 -51
README.md CHANGED
@@ -1,6 +1,10 @@
1
  ---
 
 
 
 
 
2
  license: mit
3
- library_name: dots_ocr
4
  pipeline_tag: image-text-to-text
5
  tags:
6
  - image-to-text
@@ -11,10 +15,6 @@ tags:
11
  - formula
12
  - transformers
13
  - custom_code
14
- language:
15
- - en
16
- - zh
17
- - multilingual
18
  ---
19
 
20
  <div align="center">
@@ -24,23 +24,25 @@ language:
24
  <p>
25
 
26
  <h1 align="center">
27
- dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model
28
  </h1>
29
 
30
- [![Blog](https://img.shields.io/badge/Blog-View_on_GitHub-333.svg?logo=github)](https://github.com/rednote-hilab/dots.ocr/blob/master/assets/blog.md)
31
- [![HuggingFace](https://img.shields.io/badge/HuggingFace%20Weights-black.svg?logo=HuggingFace)](https://huggingface.co/rednote-hilab/dots.ocr)
 
 
 
32
 
33
 
34
  <div align="center">
35
- <a href="https://dotsocr.xiaohongshu.com" target="_blank" rel="noopener noreferrer"><strong>🖥️ Live Demo</strong></a> |
36
- <a href="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/wechat.jpg" target="_blank" rel="noopener noreferrer"><strong>💬 WeChat</strong></a> |
37
- <a href="https://www.xiaohongshu.com/user/profile/683ffe42000000001d021a4c" target="_blank" rel="noopener noreferrer"><strong>📕 rednote</strong></a>
38
  </div>
39
 
40
  </div>
41
 
42
 
43
-
44
  ## Introduction
45
 
46
  **dots.ocr** is a powerful, multilingual document parser that unifies layout detection and content recognition within a single vision-language model while maintaining good reading order. Despite its compact 1.7B-parameter LLM foundation, it achieves state-of-the-art(SOTA) performance.
@@ -104,8 +106,8 @@ messages = [
104
 
105
  # Preparation for inference
106
  text = processor.apply_chat_template(
107
- messages,
108
- tokenize=False,
109
  add_generation_prompt=True
110
  )
111
  image_inputs, video_inputs = process_vision_info(messages)
@@ -133,11 +135,12 @@ print(output_text)
133
  ### Performance Comparison: dots.ocr vs. Competing Models
134
  <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/chart.png" border="0" />
135
 
136
- > **Notes:**
137
  > - The EN, ZH metrics are the end2end evaluation results of [OmniDocBench](https://github.com/opendatalab/OmniDocBench), and Multilingual metric is the end2end evaluation results of dots.ocr-bench.
138
 
139
 
140
- ## News
 
141
  * ```2025.07.30 ``` 🚀 We release [dots.ocr](https://github.com/rednote-hilab/dots.ocr), — a multilingual documents parsing model based on 1.7b llm, with SOTA performance.
142
 
143
 
@@ -433,7 +436,6 @@ print(output_text)
433
  <td>0.100</td>
434
  <td>0.185</td>
435
  </tr>
436
- <tr>
437
 
438
  <td rowspan="5"><strong>General<br>VLMs</strong></td>
439
  <td>GPT4o</td>
@@ -504,7 +506,7 @@ print(output_text)
504
  <td>0.295</td>
505
  <td><strong>0.384</strong></td>
506
  <td>83.3</td>
507
- <td><strong>89.3</strong></td>
508
  <td>0.165</td>
509
  <td><strong>0.085</strong></td>
510
  <td>0.058</td>
@@ -728,7 +730,7 @@ print(output_text)
728
  </tbody>
729
  </table>
730
 
731
- > **Notes:**
732
  > - The metrics are from [MonkeyOCR](https://github.com/Yuliang-Liu/MonkeyOCR), [OmniDocBench](https://github.com/opendatalab/OmniDocBench), and our own internal evaluations.
733
  > - We delete the Page-header and Page-footer cells in the result markdown.
734
  > - We use tikz_preprocess pipeline to upsample the images to dpi 200.
@@ -801,7 +803,7 @@ This is an inhouse benchmark which contain 1493 pdf images with 100 languages.
801
  </tbody>
802
  </table>
803
 
804
- > **Notes:**
805
  > - We use the same metric calculation pipeline of [OmniDocBench](https://github.com/opendatalab/OmniDocBench).
806
  > - We delete the Page-header and Page-footer cells in the result markdown.
807
 
@@ -873,7 +875,7 @@ This is an inhouse benchmark which contain 1493 pdf images with 100 languages.
873
  </tbody>
874
  </table>
875
 
876
- > **Notes:**
877
  > - prompt_layout_all_en for **parse all**, prompt_layout_only_en for **detection only**, please refer to [prompts](https://github.com/rednote-hilab/dots.ocr/blob/master/dots_ocr/utils/prompts.py)
878
 
879
 
@@ -1080,7 +1082,7 @@ This is an inhouse benchmark which contain 1493 pdf images with 100 languages.
1080
 
1081
 
1082
  > **Note:**
1083
- > - The metrics are from [MonkeyOCR](https://github.com/Yuliang-Liu/MonkeyOCR),
1084
  [olmocr](https://github.com/allenai/olmocr), and our own internal evaluations.
1085
  > - We delete the Page-header and Page-footer cells in the result markdown.
1086
 
@@ -1113,28 +1115,23 @@ pip install -e .
1113
  > 💡**Note:** Please use a directory name without periods (e.g., `DotsOCR` instead of `dots.ocr`) for the model save path. This is a temporary workaround pending our integration with Transformers.
1114
  ```shell
1115
  python3 tools/download_model.py
 
 
 
1116
  ```
1117
 
1118
 
1119
  ## 2. Deployment
1120
  ### vLLM inference
1121
- We highly recommend using vllm for deployment and inference. All of our evaluations results are based on vllm version 0.9.1.
1122
- The [Docker Image](https://hub.docker.com/r/rednotehilab/dots.ocr) is based on the official vllm image. You can also follow [Dockerfile](https://github.com/rednote-hilab/dots.ocr/blob/master/docker/Dockerfile) to build the deployment environment by yourself.
1123
 
1124
  ```shell
1125
- # You need to register model to vllm at first
1126
- python3 tools/download_model.py
1127
- export hf_model_path=./weights/DotsOCR # Path to your downloaded model weights, Please use a directory name without periods (e.g., `DotsOCR` instead of `dots.ocr`) for the model save path. This is a temporary workaround pending our integration with Transformers.
1128
- export PYTHONPATH=$(dirname "$hf_model_path"):$PYTHONPATH
1129
- sed -i '/^from vllm\.entrypoints\.cli\.main import main$/a\
1130
- from DotsOCR import modeling_dots_ocr_vllm' `which vllm` # If you downloaded model weights by yourself, please replace `DotsOCR` by your model saved directory name, and remember to use a directory name without periods (e.g., `DotsOCR` instead of `dots.ocr`)
1131
-
1132
- # launch vllm server
1133
- CUDA_VISIBLE_DEVICES=0 vllm serve ${hf_model_path} --tensor-parallel-size 1 --gpu-memory-utilization 0.95 --chat-template-content-format string --served-model-name model --trust-remote-code
1134
-
1135
- # If you get a ModuleNotFoundError: No module named 'DotsOCR', please check the note above on the saved model directory name.
1136
 
1137
- # vllm api demo
 
 
1138
  python3 ./demo/demo_vllm.py --prompt_mode prompt_layout_all_en
1139
  ```
1140
 
@@ -1197,8 +1194,8 @@ messages = [
1197
 
1198
  # Preparation for inference
1199
  text = processor.apply_chat_template(
1200
- messages,
1201
- tokenize=False,
1202
  add_generation_prompt=True
1203
  )
1204
  image_inputs, video_inputs = process_vision_info(messages)
@@ -1226,6 +1223,10 @@ print(output_text)
1226
 
1227
  </details>
1228
 
 
 
 
 
1229
  ## 3. Document Parse
1230
  **Based on vLLM server**, you can parse an image or a pdf file using the following commands:
1231
  ```bash
@@ -1234,7 +1235,7 @@ print(output_text)
1234
  # Parse a single image
1235
  python3 dots_ocr/parser.py demo/demo_image1.jpg
1236
  # Parse a single PDF
1237
- python3 dots_ocr/parser.py demo/demo_pdf1.pdf --num_threads 64 # try bigger num_threads for pdf with a large number of pages
1238
 
1239
  # Layout detection only
1240
  python3 dots_ocr/parser.py demo/demo_image1.jpg --prompt prompt_layout_only_en
@@ -1246,6 +1247,9 @@ python3 dots_ocr/parser.py demo/demo_image1.jpg --prompt prompt_ocr
1246
  python3 dots_ocr/parser.py demo/demo_image1.jpg --prompt prompt_grounding_ocr --bbox 163 241 1536 705
1247
 
1248
  ```
 
 
 
1249
 
1250
  <details>
1251
  <summary><b>Output Results</b></summary>
@@ -1294,22 +1298,34 @@ python demo/demo_gradio_annotion.py
1294
 
1295
 
1296
  ## Acknowledgments
1297
- We would like to thank [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL), [aimv2](https://github.com/apple/ml-aim), [MonkeyOCR](https://github.com/Yuliang-Liu/MonkeyOCR),
1298
- [OmniDocBench](https://github.com/opendatalab/OmniDocBench), [PyMuPDF](https://github.com/pymupdf/PyMuPDF), for providing code and models.
1299
 
1300
- We also thank [DocLayNet](https://github.com/DS4SD/DocLayNet), [M6Doc](https://github.com/HCIILAB/M6Doc), [CDLA](https://github.com/buptlihang/CDLA), [D4LA](https://github.com/AlibabaResearch/AdvancedLiterateMachinery) for providing valuable datasets.
1301
 
1302
  ## Limitation & Future Work
1303
 
1304
- - **Complex Document Elements:**
1305
- - **Table&Formula**: dots.ocr is not yet perfect for high-complexity tables and formula extraction.
1306
- - **Picture**: Pictures in documents are currently not parsed.
 
 
 
 
1307
 
1308
- - **Parsing Failures:** The model may fail to parse under certain conditions:
1309
- - When the character-to-pixel ratio is excessively high. Try enlarging the image or increasing the PDF parsing DPI (a setting of 200 is recommended). However, please note that the model performs optimally on images with a resolution under 11289600 pixels.
1310
- - Continuous special characters, such as ellipses (`...`) and underscores (`_`), may cause the prediction output to repeat endlessly. In such scenarios, consider using alternative prompts like `prompt_layout_only_en`, `prompt_ocr`, or `prompt_grounding_ocr` ([details here](https://github.com/rednote-hilab/dots.ocr/blob/master/dots_ocr/utils/prompts.py)).
1311
-
1312
- - **Performance Bottleneck:** Despite its 1.7B parameter LLM foundation, **dots.ocr** is not yet optimized for high-throughput processing of large PDF volumes.
1313
 
1314
  We are committed to achieving more accurate table and formula parsing, as well as enhancing the model's OCR capabilities for broader generalization, all while aiming for **a more powerful, more efficient model**. Furthermore, we are actively considering the development of **a more general-purpose perception model** based on Vision-Language Models (VLMs), which would integrate general detection, image captioning, and OCR tasks into a unified framework. **Parsing the content of the pictures in the documents** is also a key priority for our future work.
1315
- We believe that collaboration is the key to tackling these exciting challenges. If you are passionate about advancing the frontiers of document intelligence and are interested in contributing to these future endeavors, we would love to hear from you. Please reach out to us via email at: [[email protected]].
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
+ - zh
5
+ - multilingual
6
+ library_name: transformers
7
  license: mit
 
8
  pipeline_tag: image-text-to-text
9
  tags:
10
  - image-to-text
 
15
  - formula
16
  - transformers
17
  - custom_code
 
 
 
 
18
  ---
19
 
20
  <div align="center">
 
24
  <p>
25
 
26
  <h1 align="center">
27
+ <a href="https://huggingface.co/papers/2512.02498">dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model</a>
28
  </h1>
29
 
30
+ <p align="center">
31
+ <a href="https://huggingface.co/papers/2512.02498"><img src="https://img.shields.io/badge/Paper-HF_Link-b31b1b.svg" alt="Hugging Face Paper"></a>
32
+ <a href="https://github.com/rednote-hilab/dots.ocr"><img src="https://img.shields.io/badge/GitHub-Code-blue.svg?logo=github" alt="GitHub Code"></a>
33
+ <a href="https://dotsocr.xiaohongshu.com"><img src="https://img.shields.io/badge/Project_Page-Live_Demo-green.svg" alt="Live Demo"></a>
34
+ </p>
35
 
36
 
37
  <div align="center">
38
+ <a href="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/wechat.jpg" target="_blank" rel="noopener noreferrer"><strong>💬 WeChat</strong></a> |
39
+ <a href="https://www.xiaohongshu.com/user/profile/683ffe42000000001d021a4c" target="_blank" rel="noopener noreferrer"><strong>📕 rednote</strong></a> |
40
+ <a href="https://x.com/rednotehilab" target="_blank" rel="noopener noreferrer"><strong>🐦 X</strong></a>
41
  </div>
42
 
43
  </div>
44
 
45
 
 
46
  ## Introduction
47
 
48
  **dots.ocr** is a powerful, multilingual document parser that unifies layout detection and content recognition within a single vision-language model while maintaining good reading order. Despite its compact 1.7B-parameter LLM foundation, it achieves state-of-the-art(SOTA) performance.
 
106
 
107
  # Preparation for inference
108
  text = processor.apply_chat_template(
109
+ messages,
110
+ tokenize=False,
111
  add_generation_prompt=True
112
  )
113
  image_inputs, video_inputs = process_vision_info(messages)
 
135
  ### Performance Comparison: dots.ocr vs. Competing Models
136
  <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/chart.png" border="0" />
137
 
138
+ > **Notes:**
139
  > - The EN, ZH metrics are the end2end evaluation results of [OmniDocBench](https://github.com/opendatalab/OmniDocBench), and Multilingual metric is the end2end evaluation results of dots.ocr-bench.
140
 
141
 
142
+ ## News
143
+ * ```2025.10.31 ``` 🚀 We release [dots.ocr.base](https://huggingface.co/rednote-hilab/dots.ocr.base), foundation VLM focus on OCR tasks, also the base model of [dots.ocr](https://github.com/rednote-hilab/dots.ocr). Try it out!
144
  * ```2025.07.30 ``` 🚀 We release [dots.ocr](https://github.com/rednote-hilab/dots.ocr), — a multilingual documents parsing model based on 1.7b llm, with SOTA performance.
145
 
146
 
 
436
  <td>0.100</td>
437
  <td>0.185</td>
438
  </tr>
 
439
 
440
  <td rowspan="5"><strong>General<br>VLMs</strong></td>
441
  <td>GPT4o</td>
 
506
  <td>0.295</td>
507
  <td><strong>0.384</strong></td>
508
  <td>83.3</td>
509
+ <td><strong>89.3</strong></td>
510
  <td>0.165</td>
511
  <td><strong>0.085</strong></td>
512
  <td>0.058</td>
 
730
  </tbody>
731
  </table>
732
 
733
+ > **Notes:**
734
  > - The metrics are from [MonkeyOCR](https://github.com/Yuliang-Liu/MonkeyOCR), [OmniDocBench](https://github.com/opendatalab/OmniDocBench), and our own internal evaluations.
735
  > - We delete the Page-header and Page-footer cells in the result markdown.
736
  > - We use tikz_preprocess pipeline to upsample the images to dpi 200.
 
803
  </tbody>
804
  </table>
805
 
806
+ > **Notes:**
807
  > - We use the same metric calculation pipeline of [OmniDocBench](https://github.com/opendatalab/OmniDocBench).
808
  > - We delete the Page-header and Page-footer cells in the result markdown.
809
 
 
875
  </tbody>
876
  </table>
877
 
878
+ > **Notes:**
879
  > - prompt_layout_all_en for **parse all**, prompt_layout_only_en for **detection only**, please refer to [prompts](https://github.com/rednote-hilab/dots.ocr/blob/master/dots_ocr/utils/prompts.py)
880
 
881
 
 
1082
 
1083
 
1084
  > **Note:**
1085
+ > - The metrics are from [MonkeyOCR](https://github.com/Yuliang-Liu/MonkeyOCR),
1086
  [olmocr](https://github.com/allenai/olmocr), and our own internal evaluations.
1087
  > - We delete the Page-header and Page-footer cells in the result markdown.
1088
 
 
1115
  > 💡**Note:** Please use a directory name without periods (e.g., `DotsOCR` instead of `dots.ocr`) for the model save path. This is a temporary workaround pending our integration with Transformers.
1116
  ```shell
1117
  python3 tools/download_model.py
1118
+
1119
+ # with modelscope
1120
+ python3 tools/download_model.py --type modelscope
1121
  ```
1122
 
1123
 
1124
  ## 2. Deployment
1125
  ### vLLM inference
1126
+ We highly recommend using vLLM for deployment and inference. All of our evaluations results are based on vLLM 0.9.1 via out-of-tree model registration. **Since vLLM version 0.11.0, Dots OCR has been officially integrated into vLLM with verified performance** and you can use vLLM docker image directly (e.g, `vllm/vllm-openai:v0.11.0`) to deploy the model server.
 
1127
 
1128
  ```shell
1129
+ # Launch vLLM model server
1130
+ vllm serve rednote-hilab/dots.ocr --trust-remote-code --async-scheduling --gpu-memory-utilization 0.95
 
 
 
 
 
 
 
 
 
1131
 
1132
+ # vLLM API Demo
1133
+ # See dots_ocr/model/inference.py for details on parameter and prompt settings
1134
+ # that help achieve the best output quality.
1135
  python3 ./demo/demo_vllm.py --prompt_mode prompt_layout_all_en
1136
  ```
1137
 
 
1194
 
1195
  # Preparation for inference
1196
  text = processor.apply_chat_template(
1197
+ messages,
1198
+ tokenize=False,
1199
  add_generation_prompt=True
1200
  )
1201
  image_inputs, video_inputs = process_vision_info(messages)
 
1223
 
1224
  </details>
1225
 
1226
+ ### Hugginface inference with CPU
1227
+ Please refer to [CPU inference](https://github.com/rednote-hilab/dots.ocr/issues/1#issuecomment-3148962536)
1228
+
1229
+
1230
  ## 3. Document Parse
1231
  **Based on vLLM server**, you can parse an image or a pdf file using the following commands:
1232
  ```bash
 
1235
  # Parse a single image
1236
  python3 dots_ocr/parser.py demo/demo_image1.jpg
1237
  # Parse a single PDF
1238
+ python3 dots_ocr/parser.py demo/demo_pdf1.pdf --num_thread 64 # try bigger num_threads for pdf with a large number of pages
1239
 
1240
  # Layout detection only
1241
  python3 dots_ocr/parser.py demo/demo_image1.jpg --prompt prompt_layout_only_en
 
1247
  python3 dots_ocr/parser.py demo/demo_image1.jpg --prompt prompt_grounding_ocr --bbox 163 241 1536 705
1248
 
1249
  ```
1250
+ **Based on Transformers**, you can parse an image or a pdf file using the same commands above, just add `--use_hf true`.
1251
+
1252
+ > Notice: transformers is slower than vllm, if you want to use demo/* with transformers,just add `use_hf=True` in `DotsOCRParser(..,use_hf=True)`
1253
 
1254
  <details>
1255
  <summary><b>Output Results</b></summary>
 
1298
 
1299
 
1300
  ## Acknowledgments
1301
+ We would like to thank [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL), [aimv2](https://github.com/apple/ml-aim), [MonkeyOCR](https://github.com/Yuliang-Liu/MonkeyOCR),
1302
+ [OmniDocBench](https://github.com/opendatalab/OmniDocBench), [PyMuPDF](https://github.com/pymupdf/PyMuPDF), for providing code and models.
1303
 
1304
+ We also thank [DocLayNet](https://github.com/DS4SD/DocLayNet), [M6Doc](https://github.com/HCIILAB/M6Doc), [CDLA](https://github.com/buptlihang/CDLA), [D4LA](https://github.com/AlibabaResearch/AdvancedLiterateMachinery) for providing valuable datasets.
1305
 
1306
  ## Limitation & Future Work
1307
 
1308
+ - **Complex Document Elements:**
1309
+ - **Table&Formula**: dots.ocr is not yet perfect for high-complexity tables and formula extraction.
1310
+ - **Picture**: Pictures in documents are currently not parsed.
1311
+
1312
+ - **Parsing Failures:** The model may fail to parse under certain conditions:
1313
+ - When the character-to-pixel ratio is excessively high. Try enlarging the image or increasing the PDF parsing DPI (a setting of 200 is recommended). However, please note that the model performs optimally on images with a resolution under 11289600 pixels.
1314
+ - Continuous special characters, such as ellipses (`...`) and underscores (`_`), may cause the prediction output to repeat endlessly. In such scenarios, consider using alternative prompts like `prompt_layout_only_en`, `prompt_ocr`, or `prompt_grounding_ocr` ([details here](https://github.com/rednote-hilab/dots.ocr/blob/master/dots_ocr/utils/prompts.py)).
1315
 
1316
+ - **Performance Bottleneck:** Despite its 1.7B parameter LLM foundation, **dots.ocr** is not yet optimized for high-throughput processing of large PDF volumes.
 
 
 
 
1317
 
1318
  We are committed to achieving more accurate table and formula parsing, as well as enhancing the model's OCR capabilities for broader generalization, all while aiming for **a more powerful, more efficient model**. Furthermore, we are actively considering the development of **a more general-purpose perception model** based on Vision-Language Models (VLMs), which would integrate general detection, image captioning, and OCR tasks into a unified framework. **Parsing the content of the pictures in the documents** is also a key priority for our future work.
1319
+ We believe that collaboration is the key to tackling these exciting challenges. If you are passionate about advancing the frontiers of document intelligence and are interested in contributing to these future endeavors, we would love to hear from you. Please reach out to us via email at: [[email protected]].
1320
+
1321
+ ## Citation
1322
+ If you find our work helpful or inspiring, please feel free to cite it.
1323
+ ```bibtex
1324
+ @article{dots.ocr,
1325
+ title={dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model},
1326
+ author={[Anonymous Authors]},
1327
+ booktitle={CVPR},
1328
+ year={2025},
1329
+ url={https://huggingface.co/papers/2512.02498}
1330
+ }
1331
+ ```