A New Axis of Sparsity for Large Language Models
Xin Cheng, Wangding Zeng, Damai Dai, et al. • DeepSeek-AI / Peking University • 2025
Current Large Language Models face an interesting inefficiency. When you ask them about "Alexander the Great" or "Princess Diana", they must use multiple layers of attention and feed-forward networks just to recognize these entities. This is like using a calculator to remember what 2+2 equals instead of just looking it up!
Language modeling involves two fundamentally different tasks:
Deep, dynamic computation for understanding context, logic, and complex relationships. This is what neural networks excel at.
Static, local patterns like named entities ("New York City"), idioms ("by the way"), and factual associations. These are predictable and could be looked up.
The paper proposes that we need two complementary types of sparsity:
Sparsely activates different "expert" networks based on input. Great for dynamic reasoning.
Sparsely retrieves static embeddings via O(1) lookup. Perfect for knowledge storage.
Engram is elegantly simple in concept: use the last few tokens (N-grams) as a key to look up embeddings from a massive table, then intelligently combine them with the model's hidden states.
Extract suffix N-grams from the token sequence, hash them to get indices, and retrieve embeddings from the table. This is O(1) - constant time!
Use the current hidden state to "gate" the retrieved embeddings. If they match the context, use them; if not, suppress them.
Standard tokenizers often assign different IDs to semantically equivalent tokens (e.g., "Apple" vs " apple" vs "APPLE"). Engram compresses these into canonical forms:
This surjective (many-to-one) function collapses ~23% of vocabulary, increasing semantic density by ensuring "Apple", " apple", and "APPLE" all retrieve the same embeddings.
To handle hash collisions, Engram uses K different hash functions per N-gram order. Each head maps to a different embedding table:
Why multiple heads? Using K different hash functions means K different embeddings per N-gram. This reduces the impact of hash collisions - if two different N-grams collide in one head, they likely won't collide in all K heads.
All retrieved embeddings are concatenated:
Dimension calculation: If each embedding has dimension d, and we use N-gram orders {2, 3} with K=4 heads, the final has dimension: 2 × 4 × d = 8d
The gating mechanism computes a scalar that controls how much of the retrieved memory to use:
Intuition: This computes cosine similarity between the hidden state (what the model expects) and the retrieved memory (what we found). If they align (high similarity), and we use the memory. If they conflict (low similarity), and we ignore it.
If the retrieved memory contradicts the context, , effectively ignoring it. A final depthwise convolution adds local smoothing:
Why convolution? The convolution smooths the memory output across adjacent positions. If position 5 strongly activates memory but position 6 doesn't, the convolution helps blend them for more coherent output.
There's a trade-off in placing Engram:
The key question: given a fixed parameter budget, how should we split between MoE experts and Engram memory?
Define the allocation ratio as the fraction of inactive parameters assigned to MoE:
Key insight: . This is a zero-sum allocation - parameters given to MoE are taken from Engram and vice versa. The question is finding the optimal balance.
MoE-dominated: No dedicated memory for static patterns. Model wastes depth reconstructing them through computation.
Optimal: MoE handles dynamic reasoning, Engram handles static patterns. Each does what it's best at.
Engram-dominated: Loses conditional computation capacity. Memory can't replace reasoning!
What if we relax memory constraints and scale Engram aggressively? The results show a log-linear relationship: doubling memory slots predictably reduces loss.
The paper validates Engram at scale with four models: Dense-4B, MoE-27B, Engram-27B, and Engram-40B. All use identical training data and activated parameters.
| Benchmark | MoE-27B | Engram-27B | Gain |
|---|---|---|---|
| MMLU | 57.4 | 60.4 | +3.0 |
| CMMLU | 57.9 | 61.9 | +4.0 |
| BBH (Reasoning) | 50.9 | 55.9 | +5.0 |
| ARC-Challenge | 70.1 | 73.8 | +3.7 |
| HumanEval (Code) | 37.8 | 40.8 | +3.0 |
| MATH | 28.3 | 30.7 | +2.4 |
| GSM8K | 58.4 | 60.6 | +2.2 |
By delegating local patterns to lookups, Engram frees attention to focus on global context. Results on 32k context:
Why does a memory module improve reasoning? The paper provides compelling mechanistic evidence.
When a standard LLM processes "Diana, Princess of Wales", each layer progressively builds understanding:
Six layers of computation just to recognize an entity! Engram can retrieve this in O(1).
Using LogitLens (projecting hidden states through the final LM head), we can measure how "ready" each layer's representation is for prediction via KL divergence from the final output.
Centered Kernel Alignment (CKA) measures representational similarity between layers. The analysis reveals:
Engram's Layer 5 representations align with MoE's Layer 12! The model is effectively deeper because early layers don't waste capacity on memorization.
Ablating Engram at inference (zeroing out its output) reveals a sharp functional dichotomy:
Factual Knowledge: TriviaQA retains only 29%, PopQA 44%. Engram is the primary knowledge repository.
Reading Comprehension: C3 retains 93%, RACE-Middle 89%. Context-grounded tasks rely on attention, not Engram.
The gating mechanism activates selectively on multi-token patterns:
"Only Alexander the Great could tame the horse Bucephalus."
"By the way, I am a fan of the Milky Way."
"This study analyzes the media impact of Diana, Princess of Wales."
Red highlighting shows high gating activation. Engram correctly identifies named entities and idioms.
Engram introduces conditional memory as a new primitive for LLM architecture, complementing the established conditional computation (MoE) paradigm. Key takeaways:
Current work uses 2-3 grams. Larger scales may benefit from 4+ grams.
Current embeddings are static. Online learning could enable adaptation.
Could image/audio patterns benefit from similar lookup mechanisms?
With NVMe SSDs in the hierarchy, could push to much larger tables.
Engram represents a shift in how we think about LLM architecture. Instead of making neural networks do everything through computation, we can complement them with specialized primitives for different tasks. Conditional memory for knowledge retrieval is just the beginning - what other cognitive functions could benefit from dedicated architectural support?