Research Paper Explanation

DeepSeek-OCR: Contexts Optical Compression

Compressing Text via Vision for Efficient LLM Training

Haoran Wei, Yaofeng Sun, Yukun Li • DeepSeek-AI • 2025

OCRVision-LanguageCompressionDocument AIMoE

Abstract & Key Contributions

TL;DR
DeepSeek-OCR compresses text into images and reads it back with 97% accuracy at 10x compression. Instead of storing thousands of text tokens, store a small image and decode it when needed. This enables 200,000+ pages per day of training data generation on a single GPU!

Think of it Like This...

Imagine you have a 10-page essay to send to a friend. You could:

Option A: Send Raw Text

Type out all 5,000 words, character by character. Takes lots of space and bandwidth.

Option B: Take a Photo

Photograph the pages. Your friend reads from the image. Much smaller file!

DeepSeek-OCR is like having a super-smart friend who can look at a tiny, blurry photo of your essay and perfectly recite back every word. The "photo" (vision tokens) is much smaller than the original text, but contains all the information needed to reconstruct it.

Wait, What's a "Token"?

A token is the basic unit AI models work with. Think of it like this:

  • Text tokens: Roughly word-sized chunks. "Hello world" = 2 tokens. A page of text ≈ 500-1000 tokens.
  • Vision tokens: Small patches of an image. A 512×512 image might become 64 vision tokens.

The key insight: 1000 text tokens can be "photographed" into just 100 vision tokens. That's 10x compression!

The Problem: Text is Expensive

Training large language models requires massive amounts of text data. A typical document might contain thousands of tokens, and processing millions of documents becomes a computational bottleneck. What if we could compress text into a more efficient representation?

The insight is that text rendered on a page is actually a form of lossy compression. A 1000-token paragraph might fit in a 512x512 image using just 64 vision tokens. If we can train a model to "read" this image back into text with high fidelity, we've achieved significant compression!

The Solution: Optical Compression

1

Render Text to Image

Take a document with N text tokens and render it as a 2D image (PDF, screenshot, etc.)

2

DeepEncoder Compression

Process the image through DeepEncoder to get n vision tokens, where n << N

3

MoE Decoder

Use DeepSeek-3B-MoE to decode vision tokens back into text with high accuracy

Key Contributions

  • DeepEncoder: Novel vision encoder combining SAM (perception) + CLIP (knowledge) with 16x token compression
  • 97% OCR Accuracy: When compression ratio < 10x, the model achieves near-perfect text reconstruction
  • Multi-Resolution: Flexible modes from 64 tokens (tiny) to dynamic tiling (Gundam mode)
  • Production Scale: Generates 200k+ pages/day on single A100, 33M pages/day on 20 nodes
  • State-of-the-Art: Outperforms GOT-OCR2.0 and MinerU2.0 with fewer tokens

DeepEncoder Architecture

DeepEncoder is the core innovation - a two-stage vision encoder that extracts both low-level visual features and high-level semantic knowledge, then compresses them dramatically.

The Two-Eyes Analogy

Imagine you're reading a book. You actually use two different "modes" of vision:

Mode 1: Seeing Shapes

Your eyes detect the physical shapes of letters - the curves of "S", the lines of "T", the dots of "i". This is pure pattern recognition.

Mode 2: Understanding Words

Your brain recognizes "cat" as an animal, not just three shapes. This is semantic understanding - knowing what things mean.

DeepEncoder uses two separate neural networks - one for each mode! SAM handles "seeing shapes" and CLIP handles "understanding meaning".

Two-Component Design

Visual Perception (SAM-base)

Architecture: SAM-base with patch size 16

Parameters: ~80M

Attention: Window attention (local)

Purpose: Extract fine-grained visual features like edges, shapes, and character strokes

Visual Knowledge (CLIP-large)

Architecture: CLIP-large (modified)

Parameters: ~300M

Attention: Dense global attention

Purpose: Extract semantic understanding - what words mean, document structure

Why Two Stages?
OCR needs both perception (seeing individual characters) and knowledge(understanding what they mean in context). SAM excels at fine spatial details with window attention, while CLIP brings pre-trained language-vision alignment.

What is SAM? (Segment Anything Model)

SAM was originally created by Meta to identify objects in images. Think of it as a model that's really good at answering "where are the edges?" and "what shapes do I see?"

  • Window Attention: SAM looks at small "windows" of the image at a time, like reading one word at a time instead of the whole page. Great for fine details!
  • Patch Size 16: Divides the image into 16×16 pixel squares. Each square becomes one input token.
  • ~80M parameters: Relatively small, fast to run.

What is CLIP? (Contrastive Language-Image Pre-training)

CLIP was created by OpenAI to understand the meaning of images. It was trained on billions of image-caption pairs, so it knows that a photo of a cat relates to the word "cat".

  • Global Attention: CLIP looks at the entire image at once, understanding how all parts relate to each other.
  • Pre-trained Knowledge: Already knows millions of concepts from its original training - fonts, languages, common words, etc.
  • ~300M parameters: Larger, more powerful for understanding.

Why combine them? SAM sees the tiny details (individual letter strokes), while CLIP understands the big picture (this is English text in Times New Roman font). Together, they capture everything needed to accurately reconstruct the text.

The 16x Compression Module

Between SAM and CLIP, a convolutional module performs aggressive 16x downsampling:

Conv2d(256 → 1024, kernel=3, stride=2, padding=1) × 2 layers

Each layer halves spatial dimensions: H×W → H/2×W/2 → H/4×W/4

Combined: 16x reduction in token count (4x spatial × 4x from two layers)

Worked Example: How 16x Compression Works

Let's trace through a 1024×1024 image:

Input:1024 × 1024 pixels
After SAM:64 × 64 = 4,096 tokens(÷16 patch size)
Conv Layer 1:32 × 32 = 1,024 tokens(÷2 spatial)
Conv Layer 2:16 × 16 = 256 tokens(÷2 spatial)

Total compression: 4,096 → 256 = 16x fewer tokens! The image information is now packed into just 256 dense embeddings.

The Decoder: DeepSeek-3B-MoE

The decoder uses a Mixture-of-Experts architecture for efficient scaling:

What is Mixture-of-Experts (MoE)?

Imagine a hospital with 64 specialist doctors, but each patient only sees 6 of them.

Without MoE (Dense Model)

Every patient sees ALL 64 doctors. Extremely expensive! Like running a 3B parameter model where all 3B parameters are used for every input.

With MoE (Sparse Model)

A "router" picks the best 6 doctors for each patient. Same total expertise available, but only ~10% of the work! 3B parameters, but only 570M activated.

Result: You get the "intelligence" of a 3B model at the cost of a 570M model. This is why DeepSeek-OCR can process documents so quickly!

Total Parameters:3B(all the doctors)
Activated Parameters:570M(doctors used per input)
Routed Experts:6 of 64(specialists picked)
Shared Experts:2(always consulted)

Decoder Mapping Function

The decoder learns a non-linear mapping from compressed vision tokens to text:

fdec:Rn×dlatentRN×dtext;X^=fdec(Z)f_{\text{dec}}: \mathbb{R}^{n \times d_{\text{latent}}} \to \mathbb{R}^{N \times d_{\text{text}}}; \quad \hat{X} = f_{\text{dec}}(Z)

Symbol Breakdown:

fdecf_{\text{dec}}
The decoder function - a neural network (DeepSeek-3B-MoE) that transforms vision tokens into text representations.
nn
Number of vision tokens after compression (e.g., 64, 100, 256 depending on resolution mode).
NN
Number of text tokens in the original document. The compression ratio is N/n.
dlatentd_{\text{latent}}
Dimension of each vision token embedding (the hidden size of the vision encoder output).
dtextd_{\text{text}}
Dimension of each text token embedding (vocabulary size for output logits).
ZZ
The compressed vision representation - output of DeepEncoder with shape (n × d_latent).
X^\hat{X}
The reconstructed text - predicted token logits with shape (N × d_text).
nNn \leq N
The key constraint: vision tokens are fewer than text tokens, enabling compression.

Key insight: This is an expansion mapping. The decoder must learn to "decompress" information, generating N text tokens from only n vision tokens. This works because rendered text has significant redundancy (fonts, spacing, formatting are predictable).

Compression & Resolution Modes

DeepSeek-OCR provides multiple resolution modes to balance compression ratio against accuracy.

Understanding Compression Ratio

Compression ratio = Text tokens ÷ Vision tokens. Higher = more aggressive compression.

Worked Example: A Wikipedia Article

Let's say a Wikipedia article has 2,000 text tokens (about 1,500 words).

Using Tiny mode (64 vision tokens):2000 ÷ 64 = 31x compression
Using Small mode (100 vision tokens):2000 ÷ 100 = 20x compression
Using Base mode (256 vision tokens):2000 ÷ 256 = 7.8x compression

At 7.8x compression (Base mode), you'd get ~97% accuracy. At 20x (Small), ~60% accuracy.

Rule of thumb: Keep compression under 10x for high accuracy, under 12x for acceptable accuracy.

Valid Token Count Formula

When images have different aspect ratios, padding to a square introduces invalid tokens. The valid token count is calculated as:

Nvalid=Nactual×[1max(w,h)min(w,h)max(w,h)]N_{\text{valid}} = \left\lceil N_{\text{actual}} \times \left[1 - \frac{\max(w,h) - \min(w,h)}{\max(w,h)}\right] \right\rceil

Symbol Breakdown:

NvalidN_{\text{valid}}
The number of vision tokens that actually contain useful information (not padding).
NactualN_{\text{actual}}
Total vision tokens produced by the encoder for the padded square image.
w,hw, h
Original width and height of the input image before padding.
\lceil \cdot \rceil
Ceiling function - rounds up to the nearest integer.
maxminmax\frac{\max-\min}{\max}
The fraction of the image that is padding. A square image (w=h) has 0% padding.

Example: A 1024×512 image padded to 1024×1024 has 50% padding. If the encoder produces 256 tokens, only ~128 are valid.

Resolution Modes

ModeResolutionVision TokensUse Case
Tiny512×51264Max compression
Small640×640100Balanced
Base1024×1024256Standard quality
Large1280×1280400High quality
Gundam640 + 1024 tilesn×100 + 256Maximum quality
Gundam Mode
For high-resolution documents, Gundam mode tiles the image into overlapping 640×640 patches (100 tokens each) plus one global 1024×1024 view (256 tokens). This enables processing arbitrarily large documents while maintaining fine detail.

Compression Ratio vs Accuracy

The key finding is a predictable relationship between compression ratio and OCR accuracy:

< 10x
97%
~12x
90%
~20x
60%

Compression ratio = (Text tokens) / (Vision tokens). Higher ratio = more aggressive compression.

Training Pipeline & Data

DeepSeek-OCR uses a carefully curated training pipeline with diverse document data.

Training Data Composition

OCR 1.0 Data (PDFs)70%

30M pages across ~100 languages, 2M fine-annotated Chinese/English

General Vision Data20%

Caption, detection, grounding tasks for general vision understanding

Text-Only Pretraining10%

In-house data at 8,192 token length for language modeling

Two-Stage Training

Stage 1: DeepEncoder Pretraining

  • Data: OCR 1.0/2.0 + 100M LAION samples
  • Epochs: 2
  • Batch Size: 1,280
  • Learning Rate: 5e-5
  • Optimizer: AdamW + cosine annealing
  • Sequence Length: 4,096 tokens

Stage 2: Full Model Training

  • Hardware: 20 nodes × 8 A100-40G
  • Data Parallelism: 40
  • Global Batch Size: 640
  • Learning Rate: 3e-5 (step scheduler)
  • Throughput: 90B tokens/day (text)
  • Multimodal: 70B tokens/day
Training Efficiency
The use of MoE architecture means only 570M parameters are activated per forward pass, despite the model having 3B total parameters. This significantly reduces training compute while maintaining model capacity.

Document Types Covered

📄

PDFs

📊

Charts

📋

Tables

🔢

Formulas

⚗️

Chemistry

📐

Geometry

🖼️

Natural Images

✍️

Handwriting

Experimental Results

DeepSeek-OCR achieves state-of-the-art results while using significantly fewer vision tokens than competing methods.

OmniDocBench Comparison

What is OmniDocBench?
A comprehensive benchmark for document understanding that measures OCR accuracy using edit distance - the number of character insertions, deletions, and substitutions needed to transform the predicted text into the ground truth.
ModelTokens/PageEdit Distance ↓
MinerU2.06,790+Higher
GOT-OCR2.0256Medium
DeepSeek-OCR (Small)100Lower than GOT-OCR2.0
DeepSeek-OCR (Large)285 (valid)State-of-the-art
DeepSeek-OCR (Gundam)<800Better than MinerU2.0

Key Findings

2.5x More Efficient

Small mode (100 tokens) beats GOT-OCR2.0 (256 tokens) - achieving better accuracy with 2.5x fewer tokens!

8.5x More Efficient

Gundam mode (<800 tokens) beats MinerU2.0 (6,790+ tokens) - massive efficiency gain!

Fox Dataset Results

The Fox dataset is used to precisely measure the compression-accuracy tradeoff:

96%+
@ 10x compression
~90%
@ 12x compression
~60%
@ 20x compression
The Compression Cliff
There's a sharp accuracy degradation beyond 12x compression. This suggests a fundamental information-theoretic limit - text simply can't be compressed beyond a certain point while maintaining readability.

Why Does Accuracy Drop at High Compression?

Think about it this way: if you compress a 1000-word essay into just 50 vision tokens, each token must "remember" 20 words. There's only so much information a single embedding can hold.

  • At 10x: Each vision token represents ~10 text tokens. Manageable!
  • At 12x: Each vision token represents ~12 text tokens. Getting tight.
  • At 20x: Each vision token represents ~20 text tokens. Information loss starts.

The model learns to prioritize important words (nouns, verbs) over filler words ("the", "is"). But at very high compression, even important information gets lost.

Conclusion & Applications

Why This Actually Matters

You might be wondering: "Why not just use regular OCR?" Here's what makes DeepSeek-OCR special:

For AI Companies

  • • Process entire libraries of PDFs in days, not months
  • • Generate training data for LLMs at unprecedented scale
  • • Reduce storage costs by storing images instead of text

For Everyone Else

  • • Better ChatGPT/Claude answers from document understanding
  • • Search inside PDFs and scanned documents
  • • Preserve historical documents in searchable format

Bottom line: DeepSeek-OCR isn't just about reading text from images. It's about making all the world's documents accessible to AI systems at a fraction of the cost.

Complete Pipeline: From PDF to Text

Let's walk through exactly what happens when you process a document:

1

Render to Image

Your PDF page (say, 800 text tokens) is rendered as a 1024×1024 pixel image.

2

SAM Extracts Shapes

SAM divides the image into 16×16 patches (64×64 = 4,096 patches) and identifies letter shapes.

3

16x Compression

Two conv layers compress 4,096 patches down to 256 dense vision tokens.

4

CLIP Adds Understanding

CLIP processes the 256 tokens, adding semantic knowledge about words and document structure.

5

MoE Decoder Generates Text

The decoder (6 experts working together) generates 800 text tokens from 256 vision tokens. That's 3.1x compression with 97%+ accuracy!

Summary

DeepSeek-OCR demonstrates that text can be efficiently compressed through vision. By rendering text as images and using a specialized encoder-decoder architecture, we can achieve dramatic compression ratios while maintaining high fidelity.

  • Vision and text are interchangeable representations for dense textual content
  • Compression ratios up to 10x are nearly lossless (97% accuracy)
  • The two-stage encoder (SAM + CLIP) captures both perception and knowledge
  • MoE decoder enables efficient scaling with sparse activation

Production Impact

Single GPU

200,000+

pages per day on one A100-40G

Full Cluster

33 Million

pages per day on 20 nodes (160 GPUs)

Applications

LLM Training Data

Generate high-quality OCR data at scale for pretraining language models on document understanding.

Document Digitization

Convert legacy documents, scanned PDFs, and historical archives into searchable text.

Context Compression

Reduce context window usage by storing rendered pages as vision tokens instead of raw text.

Multimodal RAG

Enable retrieval-augmented generation directly on document images without text extraction.

Limitations & Future Work

  • Compression cliff: Beyond ~12x compression, accuracy degrades rapidly
  • Language coverage: Best results on Chinese/English, other languages need more data
  • Complex layouts: Multi-column and heavily formatted documents still challenging
  • Handwriting: Cursive and poor-quality scans remain difficult

The Big Picture

DeepSeek-OCR represents a paradigm shift: treating text and images as interchangeable representations. Instead of asking "how do we extract text from images?", we can ask "how do we compress text into images?" This enables entirely new workflows for document processing and LLM training at scale.