Compressing Text via Vision for Efficient LLM Training
Haoran Wei, Yaofeng Sun, Yukun Li • DeepSeek-AI • 2025
Imagine you have a 10-page essay to send to a friend. You could:
Option A: Send Raw Text
Type out all 5,000 words, character by character. Takes lots of space and bandwidth.
Option B: Take a Photo
Photograph the pages. Your friend reads from the image. Much smaller file!
DeepSeek-OCR is like having a super-smart friend who can look at a tiny, blurry photo of your essay and perfectly recite back every word. The "photo" (vision tokens) is much smaller than the original text, but contains all the information needed to reconstruct it.
A token is the basic unit AI models work with. Think of it like this:
The key insight: 1000 text tokens can be "photographed" into just 100 vision tokens. That's 10x compression!
Training large language models requires massive amounts of text data. A typical document might contain thousands of tokens, and processing millions of documents becomes a computational bottleneck. What if we could compress text into a more efficient representation?
The insight is that text rendered on a page is actually a form of lossy compression. A 1000-token paragraph might fit in a 512x512 image using just 64 vision tokens. If we can train a model to "read" this image back into text with high fidelity, we've achieved significant compression!
Take a document with N text tokens and render it as a 2D image (PDF, screenshot, etc.)
Process the image through DeepEncoder to get n vision tokens, where n << N
Use DeepSeek-3B-MoE to decode vision tokens back into text with high accuracy
DeepEncoder is the core innovation - a two-stage vision encoder that extracts both low-level visual features and high-level semantic knowledge, then compresses them dramatically.
Imagine you're reading a book. You actually use two different "modes" of vision:
Mode 1: Seeing Shapes
Your eyes detect the physical shapes of letters - the curves of "S", the lines of "T", the dots of "i". This is pure pattern recognition.
Mode 2: Understanding Words
Your brain recognizes "cat" as an animal, not just three shapes. This is semantic understanding - knowing what things mean.
DeepEncoder uses two separate neural networks - one for each mode! SAM handles "seeing shapes" and CLIP handles "understanding meaning".
Architecture: SAM-base with patch size 16
Parameters: ~80M
Attention: Window attention (local)
Purpose: Extract fine-grained visual features like edges, shapes, and character strokes
Architecture: CLIP-large (modified)
Parameters: ~300M
Attention: Dense global attention
Purpose: Extract semantic understanding - what words mean, document structure
SAM was originally created by Meta to identify objects in images. Think of it as a model that's really good at answering "where are the edges?" and "what shapes do I see?"
CLIP was created by OpenAI to understand the meaning of images. It was trained on billions of image-caption pairs, so it knows that a photo of a cat relates to the word "cat".
Why combine them? SAM sees the tiny details (individual letter strokes), while CLIP understands the big picture (this is English text in Times New Roman font). Together, they capture everything needed to accurately reconstruct the text.
Between SAM and CLIP, a convolutional module performs aggressive 16x downsampling:
Conv2d(256 → 1024, kernel=3, stride=2, padding=1) × 2 layers
Each layer halves spatial dimensions: H×W → H/2×W/2 → H/4×W/4
Combined: 16x reduction in token count (4x spatial × 4x from two layers)
Let's trace through a 1024×1024 image:
Total compression: 4,096 → 256 = 16x fewer tokens! The image information is now packed into just 256 dense embeddings.
The decoder uses a Mixture-of-Experts architecture for efficient scaling:
Imagine a hospital with 64 specialist doctors, but each patient only sees 6 of them.
Without MoE (Dense Model)
Every patient sees ALL 64 doctors. Extremely expensive! Like running a 3B parameter model where all 3B parameters are used for every input.
With MoE (Sparse Model)
A "router" picks the best 6 doctors for each patient. Same total expertise available, but only ~10% of the work! 3B parameters, but only 570M activated.
Result: You get the "intelligence" of a 3B model at the cost of a 570M model. This is why DeepSeek-OCR can process documents so quickly!
The decoder learns a non-linear mapping from compressed vision tokens to text:
Key insight: This is an expansion mapping. The decoder must learn to "decompress" information, generating N text tokens from only n vision tokens. This works because rendered text has significant redundancy (fonts, spacing, formatting are predictable).
DeepSeek-OCR provides multiple resolution modes to balance compression ratio against accuracy.
Compression ratio = Text tokens ÷ Vision tokens. Higher = more aggressive compression.
Let's say a Wikipedia article has 2,000 text tokens (about 1,500 words).
At 7.8x compression (Base mode), you'd get ~97% accuracy. At 20x (Small), ~60% accuracy.
Rule of thumb: Keep compression under 10x for high accuracy, under 12x for acceptable accuracy.
When images have different aspect ratios, padding to a square introduces invalid tokens. The valid token count is calculated as:
Example: A 1024×512 image padded to 1024×1024 has 50% padding. If the encoder produces 256 tokens, only ~128 are valid.
| Mode | Resolution | Vision Tokens | Use Case |
|---|---|---|---|
| Tiny | 512×512 | 64 | Max compression |
| Small | 640×640 | 100 | Balanced |
| Base | 1024×1024 | 256 | Standard quality |
| Large | 1280×1280 | 400 | High quality |
| Gundam | 640 + 1024 tiles | n×100 + 256 | Maximum quality |
The key finding is a predictable relationship between compression ratio and OCR accuracy:
Compression ratio = (Text tokens) / (Vision tokens). Higher ratio = more aggressive compression.
DeepSeek-OCR uses a carefully curated training pipeline with diverse document data.
30M pages across ~100 languages, 2M fine-annotated Chinese/English
Caption, detection, grounding tasks for general vision understanding
In-house data at 8,192 token length for language modeling
PDFs
Charts
Tables
Formulas
Chemistry
Geometry
Natural Images
Handwriting
DeepSeek-OCR achieves state-of-the-art results while using significantly fewer vision tokens than competing methods.
| Model | Tokens/Page | Edit Distance ↓ |
|---|---|---|
| MinerU2.0 | 6,790+ | Higher |
| GOT-OCR2.0 | 256 | Medium |
| DeepSeek-OCR (Small) | 100 | Lower than GOT-OCR2.0 |
| DeepSeek-OCR (Large) | 285 (valid) | State-of-the-art |
| DeepSeek-OCR (Gundam) | <800 | Better than MinerU2.0 |
Small mode (100 tokens) beats GOT-OCR2.0 (256 tokens) - achieving better accuracy with 2.5x fewer tokens!
Gundam mode (<800 tokens) beats MinerU2.0 (6,790+ tokens) - massive efficiency gain!
The Fox dataset is used to precisely measure the compression-accuracy tradeoff:
Think about it this way: if you compress a 1000-word essay into just 50 vision tokens, each token must "remember" 20 words. There's only so much information a single embedding can hold.
The model learns to prioritize important words (nouns, verbs) over filler words ("the", "is"). But at very high compression, even important information gets lost.
You might be wondering: "Why not just use regular OCR?" Here's what makes DeepSeek-OCR special:
Bottom line: DeepSeek-OCR isn't just about reading text from images. It's about making all the world's documents accessible to AI systems at a fraction of the cost.
Let's walk through exactly what happens when you process a document:
Render to Image
Your PDF page (say, 800 text tokens) is rendered as a 1024×1024 pixel image.
SAM Extracts Shapes
SAM divides the image into 16×16 patches (64×64 = 4,096 patches) and identifies letter shapes.
16x Compression
Two conv layers compress 4,096 patches down to 256 dense vision tokens.
CLIP Adds Understanding
CLIP processes the 256 tokens, adding semantic knowledge about words and document structure.
MoE Decoder Generates Text
The decoder (6 experts working together) generates 800 text tokens from 256 vision tokens. That's 3.1x compression with 97%+ accuracy!
DeepSeek-OCR demonstrates that text can be efficiently compressed through vision. By rendering text as images and using a specialized encoder-decoder architecture, we can achieve dramatic compression ratios while maintaining high fidelity.
pages per day on one A100-40G
pages per day on 20 nodes (160 GPUs)
Generate high-quality OCR data at scale for pretraining language models on document understanding.
Convert legacy documents, scanned PDFs, and historical archives into searchable text.
Reduce context window usage by storing rendered pages as vision tokens instead of raw text.
Enable retrieval-augmented generation directly on document images without text extraction.
DeepSeek-OCR represents a paradigm shift: treating text and images as interchangeable representations. Instead of asking "how do we extract text from images?", we can ask "how do we compress text into images?" This enables entirely new workflows for document processing and LLM training at scale.