Improve model card: update library_name, add paper, code, project links and citation
#40
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,6 +1,10 @@
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
license: mit
|
| 3 |
-
library_name: dots_ocr
|
| 4 |
pipeline_tag: image-text-to-text
|
| 5 |
tags:
|
| 6 |
- image-to-text
|
|
@@ -11,10 +15,6 @@ tags:
|
|
| 11 |
- formula
|
| 12 |
- transformers
|
| 13 |
- custom_code
|
| 14 |
-
language:
|
| 15 |
-
- en
|
| 16 |
-
- zh
|
| 17 |
-
- multilingual
|
| 18 |
---
|
| 19 |
|
| 20 |
<div align="center">
|
|
@@ -24,23 +24,25 @@ language:
|
|
| 24 |
<p>
|
| 25 |
|
| 26 |
<h1 align="center">
|
| 27 |
-
dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model
|
| 28 |
</h1>
|
| 29 |
|
| 30 |
-
|
| 31 |
-
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
|
| 34 |
<div align="center">
|
| 35 |
-
<a href="https://
|
| 36 |
-
<a href="https://
|
| 37 |
-
<a href="https://
|
| 38 |
</div>
|
| 39 |
|
| 40 |
</div>
|
| 41 |
|
| 42 |
|
| 43 |
-
|
| 44 |
## Introduction
|
| 45 |
|
| 46 |
**dots.ocr** is a powerful, multilingual document parser that unifies layout detection and content recognition within a single vision-language model while maintaining good reading order. Despite its compact 1.7B-parameter LLM foundation, it achieves state-of-the-art(SOTA) performance.
|
|
@@ -104,8 +106,8 @@ messages = [
|
|
| 104 |
|
| 105 |
# Preparation for inference
|
| 106 |
text = processor.apply_chat_template(
|
| 107 |
-
messages,
|
| 108 |
-
tokenize=False,
|
| 109 |
add_generation_prompt=True
|
| 110 |
)
|
| 111 |
image_inputs, video_inputs = process_vision_info(messages)
|
|
@@ -133,11 +135,12 @@ print(output_text)
|
|
| 133 |
### Performance Comparison: dots.ocr vs. Competing Models
|
| 134 |
<img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/chart.png" border="0" />
|
| 135 |
|
| 136 |
-
> **Notes:**
|
| 137 |
> - The EN, ZH metrics are the end2end evaluation results of [OmniDocBench](https://github.com/opendatalab/OmniDocBench), and Multilingual metric is the end2end evaluation results of dots.ocr-bench.
|
| 138 |
|
| 139 |
|
| 140 |
-
## News
|
|
|
|
| 141 |
* ```2025.07.30 ``` 🚀 We release [dots.ocr](https://github.com/rednote-hilab/dots.ocr), — a multilingual documents parsing model based on 1.7b llm, with SOTA performance.
|
| 142 |
|
| 143 |
|
|
@@ -433,7 +436,6 @@ print(output_text)
|
|
| 433 |
<td>0.100</td>
|
| 434 |
<td>0.185</td>
|
| 435 |
</tr>
|
| 436 |
-
<tr>
|
| 437 |
|
| 438 |
<td rowspan="5"><strong>General<br>VLMs</strong></td>
|
| 439 |
<td>GPT4o</td>
|
|
@@ -504,7 +506,7 @@ print(output_text)
|
|
| 504 |
<td>0.295</td>
|
| 505 |
<td><strong>0.384</strong></td>
|
| 506 |
<td>83.3</td>
|
| 507 |
-
|
| 508 |
<td>0.165</td>
|
| 509 |
<td><strong>0.085</strong></td>
|
| 510 |
<td>0.058</td>
|
|
@@ -728,7 +730,7 @@ print(output_text)
|
|
| 728 |
</tbody>
|
| 729 |
</table>
|
| 730 |
|
| 731 |
-
> **Notes:**
|
| 732 |
> - The metrics are from [MonkeyOCR](https://github.com/Yuliang-Liu/MonkeyOCR), [OmniDocBench](https://github.com/opendatalab/OmniDocBench), and our own internal evaluations.
|
| 733 |
> - We delete the Page-header and Page-footer cells in the result markdown.
|
| 734 |
> - We use tikz_preprocess pipeline to upsample the images to dpi 200.
|
|
@@ -801,7 +803,7 @@ This is an inhouse benchmark which contain 1493 pdf images with 100 languages.
|
|
| 801 |
</tbody>
|
| 802 |
</table>
|
| 803 |
|
| 804 |
-
> **Notes:**
|
| 805 |
> - We use the same metric calculation pipeline of [OmniDocBench](https://github.com/opendatalab/OmniDocBench).
|
| 806 |
> - We delete the Page-header and Page-footer cells in the result markdown.
|
| 807 |
|
|
@@ -873,7 +875,7 @@ This is an inhouse benchmark which contain 1493 pdf images with 100 languages.
|
|
| 873 |
</tbody>
|
| 874 |
</table>
|
| 875 |
|
| 876 |
-
> **Notes:**
|
| 877 |
> - prompt_layout_all_en for **parse all**, prompt_layout_only_en for **detection only**, please refer to [prompts](https://github.com/rednote-hilab/dots.ocr/blob/master/dots_ocr/utils/prompts.py)
|
| 878 |
|
| 879 |
|
|
@@ -1080,7 +1082,7 @@ This is an inhouse benchmark which contain 1493 pdf images with 100 languages.
|
|
| 1080 |
|
| 1081 |
|
| 1082 |
> **Note:**
|
| 1083 |
-
> - The metrics are from [MonkeyOCR](https://github.com/Yuliang-Liu/MonkeyOCR),
|
| 1084 |
[olmocr](https://github.com/allenai/olmocr), and our own internal evaluations.
|
| 1085 |
> - We delete the Page-header and Page-footer cells in the result markdown.
|
| 1086 |
|
|
@@ -1113,28 +1115,23 @@ pip install -e .
|
|
| 1113 |
> 💡**Note:** Please use a directory name without periods (e.g., `DotsOCR` instead of `dots.ocr`) for the model save path. This is a temporary workaround pending our integration with Transformers.
|
| 1114 |
```shell
|
| 1115 |
python3 tools/download_model.py
|
|
|
|
|
|
|
|
|
|
| 1116 |
```
|
| 1117 |
|
| 1118 |
|
| 1119 |
## 2. Deployment
|
| 1120 |
### vLLM inference
|
| 1121 |
-
We highly recommend using
|
| 1122 |
-
The [Docker Image](https://hub.docker.com/r/rednotehilab/dots.ocr) is based on the official vllm image. You can also follow [Dockerfile](https://github.com/rednote-hilab/dots.ocr/blob/master/docker/Dockerfile) to build the deployment environment by yourself.
|
| 1123 |
|
| 1124 |
```shell
|
| 1125 |
-
#
|
| 1126 |
-
|
| 1127 |
-
export hf_model_path=./weights/DotsOCR # Path to your downloaded model weights, Please use a directory name without periods (e.g., `DotsOCR` instead of `dots.ocr`) for the model save path. This is a temporary workaround pending our integration with Transformers.
|
| 1128 |
-
export PYTHONPATH=$(dirname "$hf_model_path"):$PYTHONPATH
|
| 1129 |
-
sed -i '/^from vllm\.entrypoints\.cli\.main import main$/a\
|
| 1130 |
-
from DotsOCR import modeling_dots_ocr_vllm' `which vllm` # If you downloaded model weights by yourself, please replace `DotsOCR` by your model saved directory name, and remember to use a directory name without periods (e.g., `DotsOCR` instead of `dots.ocr`)
|
| 1131 |
-
|
| 1132 |
-
# launch vllm server
|
| 1133 |
-
CUDA_VISIBLE_DEVICES=0 vllm serve ${hf_model_path} --tensor-parallel-size 1 --gpu-memory-utilization 0.95 --chat-template-content-format string --served-model-name model --trust-remote-code
|
| 1134 |
-
|
| 1135 |
-
# If you get a ModuleNotFoundError: No module named 'DotsOCR', please check the note above on the saved model directory name.
|
| 1136 |
|
| 1137 |
-
#
|
|
|
|
|
|
|
| 1138 |
python3 ./demo/demo_vllm.py --prompt_mode prompt_layout_all_en
|
| 1139 |
```
|
| 1140 |
|
|
@@ -1197,8 +1194,8 @@ messages = [
|
|
| 1197 |
|
| 1198 |
# Preparation for inference
|
| 1199 |
text = processor.apply_chat_template(
|
| 1200 |
-
messages,
|
| 1201 |
-
tokenize=False,
|
| 1202 |
add_generation_prompt=True
|
| 1203 |
)
|
| 1204 |
image_inputs, video_inputs = process_vision_info(messages)
|
|
@@ -1226,6 +1223,10 @@ print(output_text)
|
|
| 1226 |
|
| 1227 |
</details>
|
| 1228 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1229 |
## 3. Document Parse
|
| 1230 |
**Based on vLLM server**, you can parse an image or a pdf file using the following commands:
|
| 1231 |
```bash
|
|
@@ -1234,7 +1235,7 @@ print(output_text)
|
|
| 1234 |
# Parse a single image
|
| 1235 |
python3 dots_ocr/parser.py demo/demo_image1.jpg
|
| 1236 |
# Parse a single PDF
|
| 1237 |
-
python3 dots_ocr/parser.py demo/demo_pdf1.pdf --
|
| 1238 |
|
| 1239 |
# Layout detection only
|
| 1240 |
python3 dots_ocr/parser.py demo/demo_image1.jpg --prompt prompt_layout_only_en
|
|
@@ -1246,6 +1247,9 @@ python3 dots_ocr/parser.py demo/demo_image1.jpg --prompt prompt_ocr
|
|
| 1246 |
python3 dots_ocr/parser.py demo/demo_image1.jpg --prompt prompt_grounding_ocr --bbox 163 241 1536 705
|
| 1247 |
|
| 1248 |
```
|
|
|
|
|
|
|
|
|
|
| 1249 |
|
| 1250 |
<details>
|
| 1251 |
<summary><b>Output Results</b></summary>
|
|
@@ -1294,22 +1298,34 @@ python demo/demo_gradio_annotion.py
|
|
| 1294 |
|
| 1295 |
|
| 1296 |
## Acknowledgments
|
| 1297 |
-
We would like to thank [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL), [aimv2](https://github.com/apple/ml-aim), [MonkeyOCR](https://github.com/Yuliang-Liu/MonkeyOCR),
|
| 1298 |
-
[OmniDocBench](https://github.com/opendatalab/OmniDocBench), [PyMuPDF](https://github.com/pymupdf/PyMuPDF), for providing code and models.
|
| 1299 |
|
| 1300 |
-
We also thank [DocLayNet](https://github.com/DS4SD/DocLayNet), [M6Doc](https://github.com/HCIILAB/M6Doc), [CDLA](https://github.com/buptlihang/CDLA), [D4LA](https://github.com/AlibabaResearch/AdvancedLiterateMachinery) for providing valuable datasets.
|
| 1301 |
|
| 1302 |
## Limitation & Future Work
|
| 1303 |
|
| 1304 |
-
-
|
| 1305 |
-
|
| 1306 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1307 |
|
| 1308 |
-
-
|
| 1309 |
-
- When the character-to-pixel ratio is excessively high. Try enlarging the image or increasing the PDF parsing DPI (a setting of 200 is recommended). However, please note that the model performs optimally on images with a resolution under 11289600 pixels.
|
| 1310 |
-
- Continuous special characters, such as ellipses (`...`) and underscores (`_`), may cause the prediction output to repeat endlessly. In such scenarios, consider using alternative prompts like `prompt_layout_only_en`, `prompt_ocr`, or `prompt_grounding_ocr` ([details here](https://github.com/rednote-hilab/dots.ocr/blob/master/dots_ocr/utils/prompts.py)).
|
| 1311 |
-
|
| 1312 |
-
- **Performance Bottleneck:** Despite its 1.7B parameter LLM foundation, **dots.ocr** is not yet optimized for high-throughput processing of large PDF volumes.
|
| 1313 |
|
| 1314 |
We are committed to achieving more accurate table and formula parsing, as well as enhancing the model's OCR capabilities for broader generalization, all while aiming for **a more powerful, more efficient model**. Furthermore, we are actively considering the development of **a more general-purpose perception model** based on Vision-Language Models (VLMs), which would integrate general detection, image captioning, and OCR tasks into a unified framework. **Parsing the content of the pictures in the documents** is also a key priority for our future work.
|
| 1315 |
-
We believe that collaboration is the key to tackling these exciting challenges. If you are passionate about advancing the frontiers of document intelligence and are interested in contributing to these future endeavors, we would love to hear from you. Please reach out to us via email at: [[email protected]].
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
- zh
|
| 5 |
+
- multilingual
|
| 6 |
+
library_name: transformers
|
| 7 |
license: mit
|
|
|
|
| 8 |
pipeline_tag: image-text-to-text
|
| 9 |
tags:
|
| 10 |
- image-to-text
|
|
|
|
| 15 |
- formula
|
| 16 |
- transformers
|
| 17 |
- custom_code
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
---
|
| 19 |
|
| 20 |
<div align="center">
|
|
|
|
| 24 |
<p>
|
| 25 |
|
| 26 |
<h1 align="center">
|
| 27 |
+
<a href="https://huggingface.co/papers/2512.02498">dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model</a>
|
| 28 |
</h1>
|
| 29 |
|
| 30 |
+
<p align="center">
|
| 31 |
+
<a href="https://huggingface.co/papers/2512.02498"><img src="https://img.shields.io/badge/Paper-HF_Link-b31b1b.svg" alt="Hugging Face Paper"></a>
|
| 32 |
+
<a href="https://github.com/rednote-hilab/dots.ocr"><img src="https://img.shields.io/badge/GitHub-Code-blue.svg?logo=github" alt="GitHub Code"></a>
|
| 33 |
+
<a href="https://dotsocr.xiaohongshu.com"><img src="https://img.shields.io/badge/Project_Page-Live_Demo-green.svg" alt="Live Demo"></a>
|
| 34 |
+
</p>
|
| 35 |
|
| 36 |
|
| 37 |
<div align="center">
|
| 38 |
+
<a href="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/wechat.jpg" target="_blank" rel="noopener noreferrer"><strong>💬 WeChat</strong></a> |
|
| 39 |
+
<a href="https://www.xiaohongshu.com/user/profile/683ffe42000000001d021a4c" target="_blank" rel="noopener noreferrer"><strong>📕 rednote</strong></a> |
|
| 40 |
+
<a href="https://x.com/rednotehilab" target="_blank" rel="noopener noreferrer"><strong>🐦 X</strong></a>
|
| 41 |
</div>
|
| 42 |
|
| 43 |
</div>
|
| 44 |
|
| 45 |
|
|
|
|
| 46 |
## Introduction
|
| 47 |
|
| 48 |
**dots.ocr** is a powerful, multilingual document parser that unifies layout detection and content recognition within a single vision-language model while maintaining good reading order. Despite its compact 1.7B-parameter LLM foundation, it achieves state-of-the-art(SOTA) performance.
|
|
|
|
| 106 |
|
| 107 |
# Preparation for inference
|
| 108 |
text = processor.apply_chat_template(
|
| 109 |
+
messages,
|
| 110 |
+
tokenize=False,
|
| 111 |
add_generation_prompt=True
|
| 112 |
)
|
| 113 |
image_inputs, video_inputs = process_vision_info(messages)
|
|
|
|
| 135 |
### Performance Comparison: dots.ocr vs. Competing Models
|
| 136 |
<img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/chart.png" border="0" />
|
| 137 |
|
| 138 |
+
> **Notes:**
|
| 139 |
> - The EN, ZH metrics are the end2end evaluation results of [OmniDocBench](https://github.com/opendatalab/OmniDocBench), and Multilingual metric is the end2end evaluation results of dots.ocr-bench.
|
| 140 |
|
| 141 |
|
| 142 |
+
## News
|
| 143 |
+
* ```2025.10.31 ``` 🚀 We release [dots.ocr.base](https://huggingface.co/rednote-hilab/dots.ocr.base), foundation VLM focus on OCR tasks, also the base model of [dots.ocr](https://github.com/rednote-hilab/dots.ocr). Try it out!
|
| 144 |
* ```2025.07.30 ``` 🚀 We release [dots.ocr](https://github.com/rednote-hilab/dots.ocr), — a multilingual documents parsing model based on 1.7b llm, with SOTA performance.
|
| 145 |
|
| 146 |
|
|
|
|
| 436 |
<td>0.100</td>
|
| 437 |
<td>0.185</td>
|
| 438 |
</tr>
|
|
|
|
| 439 |
|
| 440 |
<td rowspan="5"><strong>General<br>VLMs</strong></td>
|
| 441 |
<td>GPT4o</td>
|
|
|
|
| 506 |
<td>0.295</td>
|
| 507 |
<td><strong>0.384</strong></td>
|
| 508 |
<td>83.3</td>
|
| 509 |
+
<td><strong>89.3</strong></td>
|
| 510 |
<td>0.165</td>
|
| 511 |
<td><strong>0.085</strong></td>
|
| 512 |
<td>0.058</td>
|
|
|
|
| 730 |
</tbody>
|
| 731 |
</table>
|
| 732 |
|
| 733 |
+
> **Notes:**
|
| 734 |
> - The metrics are from [MonkeyOCR](https://github.com/Yuliang-Liu/MonkeyOCR), [OmniDocBench](https://github.com/opendatalab/OmniDocBench), and our own internal evaluations.
|
| 735 |
> - We delete the Page-header and Page-footer cells in the result markdown.
|
| 736 |
> - We use tikz_preprocess pipeline to upsample the images to dpi 200.
|
|
|
|
| 803 |
</tbody>
|
| 804 |
</table>
|
| 805 |
|
| 806 |
+
> **Notes:**
|
| 807 |
> - We use the same metric calculation pipeline of [OmniDocBench](https://github.com/opendatalab/OmniDocBench).
|
| 808 |
> - We delete the Page-header and Page-footer cells in the result markdown.
|
| 809 |
|
|
|
|
| 875 |
</tbody>
|
| 876 |
</table>
|
| 877 |
|
| 878 |
+
> **Notes:**
|
| 879 |
> - prompt_layout_all_en for **parse all**, prompt_layout_only_en for **detection only**, please refer to [prompts](https://github.com/rednote-hilab/dots.ocr/blob/master/dots_ocr/utils/prompts.py)
|
| 880 |
|
| 881 |
|
|
|
|
| 1082 |
|
| 1083 |
|
| 1084 |
> **Note:**
|
| 1085 |
+
> - The metrics are from [MonkeyOCR](https://github.com/Yuliang-Liu/MonkeyOCR),
|
| 1086 |
[olmocr](https://github.com/allenai/olmocr), and our own internal evaluations.
|
| 1087 |
> - We delete the Page-header and Page-footer cells in the result markdown.
|
| 1088 |
|
|
|
|
| 1115 |
> 💡**Note:** Please use a directory name without periods (e.g., `DotsOCR` instead of `dots.ocr`) for the model save path. This is a temporary workaround pending our integration with Transformers.
|
| 1116 |
```shell
|
| 1117 |
python3 tools/download_model.py
|
| 1118 |
+
|
| 1119 |
+
# with modelscope
|
| 1120 |
+
python3 tools/download_model.py --type modelscope
|
| 1121 |
```
|
| 1122 |
|
| 1123 |
|
| 1124 |
## 2. Deployment
|
| 1125 |
### vLLM inference
|
| 1126 |
+
We highly recommend using vLLM for deployment and inference. All of our evaluations results are based on vLLM 0.9.1 via out-of-tree model registration. **Since vLLM version 0.11.0, Dots OCR has been officially integrated into vLLM with verified performance** and you can use vLLM docker image directly (e.g, `vllm/vllm-openai:v0.11.0`) to deploy the model server.
|
|
|
|
| 1127 |
|
| 1128 |
```shell
|
| 1129 |
+
# Launch vLLM model server
|
| 1130 |
+
vllm serve rednote-hilab/dots.ocr --trust-remote-code --async-scheduling --gpu-memory-utilization 0.95
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1131 |
|
| 1132 |
+
# vLLM API Demo
|
| 1133 |
+
# See dots_ocr/model/inference.py for details on parameter and prompt settings
|
| 1134 |
+
# that help achieve the best output quality.
|
| 1135 |
python3 ./demo/demo_vllm.py --prompt_mode prompt_layout_all_en
|
| 1136 |
```
|
| 1137 |
|
|
|
|
| 1194 |
|
| 1195 |
# Preparation for inference
|
| 1196 |
text = processor.apply_chat_template(
|
| 1197 |
+
messages,
|
| 1198 |
+
tokenize=False,
|
| 1199 |
add_generation_prompt=True
|
| 1200 |
)
|
| 1201 |
image_inputs, video_inputs = process_vision_info(messages)
|
|
|
|
| 1223 |
|
| 1224 |
</details>
|
| 1225 |
|
| 1226 |
+
### Hugginface inference with CPU
|
| 1227 |
+
Please refer to [CPU inference](https://github.com/rednote-hilab/dots.ocr/issues/1#issuecomment-3148962536)
|
| 1228 |
+
|
| 1229 |
+
|
| 1230 |
## 3. Document Parse
|
| 1231 |
**Based on vLLM server**, you can parse an image or a pdf file using the following commands:
|
| 1232 |
```bash
|
|
|
|
| 1235 |
# Parse a single image
|
| 1236 |
python3 dots_ocr/parser.py demo/demo_image1.jpg
|
| 1237 |
# Parse a single PDF
|
| 1238 |
+
python3 dots_ocr/parser.py demo/demo_pdf1.pdf --num_thread 64 # try bigger num_threads for pdf with a large number of pages
|
| 1239 |
|
| 1240 |
# Layout detection only
|
| 1241 |
python3 dots_ocr/parser.py demo/demo_image1.jpg --prompt prompt_layout_only_en
|
|
|
|
| 1247 |
python3 dots_ocr/parser.py demo/demo_image1.jpg --prompt prompt_grounding_ocr --bbox 163 241 1536 705
|
| 1248 |
|
| 1249 |
```
|
| 1250 |
+
**Based on Transformers**, you can parse an image or a pdf file using the same commands above, just add `--use_hf true`.
|
| 1251 |
+
|
| 1252 |
+
> Notice: transformers is slower than vllm, if you want to use demo/* with transformers,just add `use_hf=True` in `DotsOCRParser(..,use_hf=True)`
|
| 1253 |
|
| 1254 |
<details>
|
| 1255 |
<summary><b>Output Results</b></summary>
|
|
|
|
| 1298 |
|
| 1299 |
|
| 1300 |
## Acknowledgments
|
| 1301 |
+
We would like to thank [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL), [aimv2](https://github.com/apple/ml-aim), [MonkeyOCR](https://github.com/Yuliang-Liu/MonkeyOCR),
|
| 1302 |
+
[OmniDocBench](https://github.com/opendatalab/OmniDocBench), [PyMuPDF](https://github.com/pymupdf/PyMuPDF), for providing code and models.
|
| 1303 |
|
| 1304 |
+
We also thank [DocLayNet](https://github.com/DS4SD/DocLayNet), [M6Doc](https://github.com/HCIILAB/M6Doc), [CDLA](https://github.com/buptlihang/CDLA), [D4LA](https://github.com/AlibabaResearch/AdvancedLiterateMachinery) for providing valuable datasets.
|
| 1305 |
|
| 1306 |
## Limitation & Future Work
|
| 1307 |
|
| 1308 |
+
- **Complex Document Elements:**
|
| 1309 |
+
- **Table&Formula**: dots.ocr is not yet perfect for high-complexity tables and formula extraction.
|
| 1310 |
+
- **Picture**: Pictures in documents are currently not parsed.
|
| 1311 |
+
|
| 1312 |
+
- **Parsing Failures:** The model may fail to parse under certain conditions:
|
| 1313 |
+
- When the character-to-pixel ratio is excessively high. Try enlarging the image or increasing the PDF parsing DPI (a setting of 200 is recommended). However, please note that the model performs optimally on images with a resolution under 11289600 pixels.
|
| 1314 |
+
- Continuous special characters, such as ellipses (`...`) and underscores (`_`), may cause the prediction output to repeat endlessly. In such scenarios, consider using alternative prompts like `prompt_layout_only_en`, `prompt_ocr`, or `prompt_grounding_ocr` ([details here](https://github.com/rednote-hilab/dots.ocr/blob/master/dots_ocr/utils/prompts.py)).
|
| 1315 |
|
| 1316 |
+
- **Performance Bottleneck:** Despite its 1.7B parameter LLM foundation, **dots.ocr** is not yet optimized for high-throughput processing of large PDF volumes.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1317 |
|
| 1318 |
We are committed to achieving more accurate table and formula parsing, as well as enhancing the model's OCR capabilities for broader generalization, all while aiming for **a more powerful, more efficient model**. Furthermore, we are actively considering the development of **a more general-purpose perception model** based on Vision-Language Models (VLMs), which would integrate general detection, image captioning, and OCR tasks into a unified framework. **Parsing the content of the pictures in the documents** is also a key priority for our future work.
|
| 1319 |
+
We believe that collaboration is the key to tackling these exciting challenges. If you are passionate about advancing the frontiers of document intelligence and are interested in contributing to these future endeavors, we would love to hear from you. Please reach out to us via email at: [[email protected]].
|
| 1320 |
+
|
| 1321 |
+
## Citation
|
| 1322 |
+
If you find our work helpful or inspiring, please feel free to cite it.
|
| 1323 |
+
```bibtex
|
| 1324 |
+
@article{dots.ocr,
|
| 1325 |
+
title={dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model},
|
| 1326 |
+
author={[Anonymous Authors]},
|
| 1327 |
+
booktitle={CVPR},
|
| 1328 |
+
year={2025},
|
| 1329 |
+
url={https://huggingface.co/papers/2512.02498}
|
| 1330 |
+
}
|
| 1331 |
+
```
|