Vision Tokens vs Text Tokens: Understanding the 10× Compression

Community Article Published October 22, 2025

The Claim

DeepSeek-OCR demonstrates that 100 vision tokens can represent approximately 1000 text tokens with 97%+ accuracy. At first glance, this seems like a simple 10× compression ratio. But what most people miss is why this is possible—and it reveals fundamental differences in how these two types of tokens work.


One Vision Token Contains Way More Information

A Concrete Example

Let's look at what's actually in each type of token:

Text Token:

Token: "Annual"
Information: One word (or subword)
Representation: Token ID → Embedding (4096-dim vector)

Vision Token (from DeepSeek-OCR):

After encoding a 1024×1024 document image:
- Initial patches: 1024/16 × 1024/16 = 4096 patches (16×16 pixels each)
- After 16× compression: 4096/16 = 256 vision tokens
- Each vision token represents: 64×64 pixels

At typical document DPI (150-200), in a 64×64 pixel region:
- Characters: ~6-8 chars wide × 4-5 lines tall
- Words: approximately 5-8 words
- Plus: font style, size, layout, spacing information

Visualizing a Vision Token

Here's what one vision token might contain in a document:

┌──────────────────────────────┐ ← 64 pixels wide
│                              │
│  Annual Revenue Growth       │ ← Line 1 (3 words)
│  Q4 2024: $2.1M              │ ← Line 2 (3 words)
│  Increase: 15.3%             │ ← Line 3 (2 words)
│                              │
└──────────────────────────────┘
      ↓ (Vision Encoder)
  One 4096-dim vector

vs.

Text tokens: ["Annual", "Revenue", "Growth", "Q4", "2024", ":", "$", "2", ".", "1", "M", ...]
             ~12 separate tokens for the same content

The Information Density Gap

Type Coverage Information Content
Text Token 1 word ~1 word of text
Vision Token 64×64 pixels ~5-8 words + layout + formatting

A vision token contains 5-10× more information than a text token, yet they both get mapped to the same embedding dimension (4096-dim).

This is why 100 vision tokens can effectively represent 1000 text tokens—the information density is fundamentally different.


Why Do They End Up the Same Embedding Size?

Despite containing vastly different amounts of information, both token types end up as 4096-dimensional vectors. But they get there very differently.

The 4096-Dim Latent Space

The embedding dimension is chosen for representation richness: enough dimensions to capture semantic relationships and allow attention mechanisms to work. This is a learned, dense, continuous space.

Different Journeys to 4096-Dim

Text Tokens: Through the Vocabulary

Token ID: 42 ("Annual")
   ↓
[Implicitly: 129K-dimensional vocabulary space] 
   ↓
Embedding lookup → 4096-dim
   ↓
LLM processing (4096-dim)
   ↓
Output projection → 129K logits
   ↓
Softmax → next token ID

Vision Tokens: Direct Compression

Raw pixels: 64×64×3 = 12,288 values
   ↓
Vision encoder → 4096-dim
   ↓
LLM processing (4096-dim)
   ↓
[No output - vision tokens are input only]

Vision tokens are already continuous—they compress directly into the latent space and stay there. No vocabulary, no blowup, seamless.

References

  • DeepSeek-OCR Paper: "DeepSeek-OCR: Contexts Optical Compression"
  • Fox benchmark compression results (Table 2, page 10)
  • DeepEncoder architecture (Section 3.2, pages 5-7)

Community

Sign up or log in to comment