🌟 Github | 📥 Model Download | 📄 Paper Link | 📄 Arxiv Paper Link |

DeepSeek-OCR: Contexts Optical Compression

Explore the boundaries of visual-text compression.

Usage

Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.12.9 + CUDA11.8：

torch==2.6.0
transformers==4.46.3
tokenizers==0.20.3
einops
addict 
easydict
pip install flash-attn==2.7.3 --no-build-isolation

from transformers import AutoModel, AutoTokenizer
import torch
import os
os.environ["CUDA_VISIBLE_DEVICES"] = '0'
model_name = 'deepseek-ai/DeepSeek-OCR'

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, _attn_implementation='flash_attention_2', trust_remote_code=True, use_safetensors=True)
model = model.eval().cuda().to(torch.bfloat16)

# prompt = "<image>\nFree OCR. "
prompt = "<image>\n<|grounding|>Convert the document to markdown. "
image_file = 'your_image.jpg'
output_path = 'your/output/dir'

# infer(self, tokenizer, prompt='', image_file='', output_path = ' ', base_size = 1024, image_size = 640, crop_mode = True, test_compress = False, save_results = False):

# Tiny: base_size = 512, image_size = 512, crop_mode = False
# Small: base_size = 640, image_size = 640, crop_mode = False
# Base: base_size = 1024, image_size = 1024, crop_mode = False
# Large: base_size = 1280, image_size = 1280, crop_mode = False

# Gundam: base_size = 1024, image_size = 640, crop_mode = True

res = model.infer(tokenizer, prompt=prompt, image_file=image_file, output_path = output_path, base_size = 1024, image_size = 640, crop_mode=True, save_results = True, test_compress = True)

vLLM

Refer to 🌟GitHub for guidance on model inference acceleration and PDF processing, etc.

[2025/10/23] 🚀🚀🚀 DeepSeek-OCR is now officially supported in upstream vLLM.

uv venv
source .venv/bin/activate
# Until v0.11.1 release, you need to install vLLM from nightly build
uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly

from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
from PIL import Image

# Create model instance
llm = LLM(
    model="deepseek-ai/DeepSeek-OCR",
    enable_prefix_caching=False,
    mm_processor_cache_gb=0,
    logits_processors=[NGramPerReqLogitsProcessor]
)

# Prepare batched input with your image file
image_1 = Image.open("path/to/your/image_1.png").convert("RGB")
image_2 = Image.open("path/to/your/image_2.png").convert("RGB")
prompt = "<image>\nFree OCR."

model_input = [
    {
        "prompt": prompt,
        "multi_modal_data": {"image": image_1}
    },
    {
        "prompt": prompt,
        "multi_modal_data": {"image": image_2}
    }
]

sampling_param = SamplingParams(
            temperature=0.0,
            max_tokens=8192,
            # ngram logit processor args
            extra_args=dict(
                ngram_size=30,
                window_size=90,
                whitelist_token_ids={128821, 128822},  # whitelist: <td>, </td>
            ),
            skip_special_tokens=False,
        )
# Generate output
model_outputs = llm.generate(model_input, sampling_param)

# Print output
for output in model_outputs:
    print(output.outputs[0].text)

Visualizations

Acknowledgement

We would like to thank Vary, GOT-OCR2.0, MinerU, PaddleOCR, OneChart, Slow Perception for their valuable models and ideas.

We also appreciate the benchmarks: Fox, OminiDocBench.

🔍 Summary of Changes

1. Formatting & Clean-Up

Removed extra spaces, blank lines, and inconsistent indentation.
Fixed small style issues (like missing spaces in comments).
Added missing newline at the end of the file.

2. Device and Dtype Handling

Added automatic device detection:

model_device = next(self.parameters()).device

Added adaptive dtype logic:

image_dtype = torch.bfloat16 if model_device.type == "cuda" else torch.float32

Replaced all hardcoded .cuda() and .to(torch.bfloat16) with:
```
.to(model_device)
.to(image_dtype)
```

✅ Now works automatically on both GPU and CPU, without device mismatch errors.

3. Autocast and Inference Improvements

Wrapped generation in a conditional autocast block:

use_autocast = model_device.type == "cuda"
if use_autocast:
    with torch.autocast("cuda", dtype=torch.bfloat16):
        with torch.no_grad():
            ...
else:
    with torch.no_grad():
        ...

Reduces memory usage and speeds up inference on GPU.
Added torch.no_grad() for safer evaluation (no gradient tracking).

4. Image Preprocessing

All image tensors now use:
```
.to(image_dtype)
```
instead of hardcoded torch.bfloat16.
Improves flexibility and prevents dtype errors when running on CPU.

5. Generation Parameter Updates

Adjusted text generation settings:

do_sample = False
num_beams = 1
max_new_tokens = 4096  # was 8192
min_new_tokens = 1
repetition_penalty = 1.2
pad_token_id = tokenizer.pad_token_id or tokenizer.eos_token_id

🧠 Results: Faster, more controlled generation; avoids repetitive or runaway outputs.

6. Safer Decoding

Cleaned up decoding logic:

input_length = input_ids.unsqueeze(0).to(model_device).shape[1]
outputs = tokenizer.decode(output_ids[0, input_length:])

✅ Avoids CUDA-specific assumptions, consistent across devices.

7. Miscellaneous

Added helpful comments for clarity.
Improved readability around image transformation and saving results.
Added extra blank lines for cleaner structure.

⚙️ Overall Impact

Category	Before	After
Device handling	Hardcoded `.cuda()`	Auto-detected and flexible
Dtype	Always bfloat16	Adaptive: bfloat16 (GPU) / float32 (CPU)
Inference	Could crash on CPU	Runs safely everywhere
Generation	Unbounded, repetitive	Tuned and stable
Readability	Mixed formatting	Clean and consistent

Citation

@article{wei2025deepseek,
  title={DeepSeek-OCR: Contexts Optical Compression},
  author={Wei, Haoran and Sun, Yaofeng and Li, Yukun},
  journal={arXiv preprint arXiv:2510.18234},
  year={2025}
}

Downloads last month: 1

Safetensors

Model size

3B params

Tensor type

BF16

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support