Research Paper Explanation

Engram: Conditional Memory via Scalable Lookup

A New Axis of Sparsity for Large Language Models

Xin Cheng, Wangding Zeng, Damai Dai, et al. • DeepSeek-AI / Peking University • 2025

MoEN-gramSparsityMemory AugmentationScaling Laws

Abstract & Key Contributions

TL;DR
Engram introduces conditional memory as a new way to scale LLMs alongside Mixture-of-Experts (MoE). Instead of using computation to "remember" things, Engram uses fast O(1) lookups into a massive embedding table. This makes models better at both knowledge retrieval AND reasoning!

The Problem: LLMs Waste Compute on Memorization

Current Large Language Models face an interesting inefficiency. When you ask them about "Alexander the Great" or "Princess Diana", they must use multiple layers of attention and feed-forward networks just to recognize these entities. This is like using a calculator to remember what 2+2 equals instead of just looking it up!

Language modeling involves two fundamentally different tasks:

Compositional Reasoning

Deep, dynamic computation for understanding context, logic, and complex relationships. This is what neural networks excel at.

Knowledge Retrieval

Static, local patterns like named entities ("New York City"), idioms ("by the way"), and factual associations. These are predictable and could be looked up.

The Solution: Two Axes of Sparsity

The paper proposes that we need two complementary types of sparsity:

1

Conditional Computation (MoE)

Sparsely activates different "expert" networks based on input. Great for dynamic reasoning.

2

Conditional Memory (Engram)

Sparsely retrieves static embeddings via O(1) lookup. Perfect for knowledge storage.

Key Contributions

  • Engram Module: Modernized N-gram embeddings with multi-head hashing, context-aware gating, and tokenizer compression
  • U-Shaped Scaling Law: Discovered optimal allocation between MoE and Engram (around 75-80% to MoE)
  • 27B Scale Validation: Outperforms iso-parameter MoE baselines on knowledge AND reasoning tasks
  • Mechanistic Insights: Engram effectively "deepens" the network by freeing early layers from memorization
  • System Efficiency: Deterministic addressing enables prefetching with <3% overhead even for 100B parameter tables

The Engram Architecture

Engram is elegantly simple in concept: use the last few tokens (N-grams) as a key to look up embeddings from a massive table, then intelligently combine them with the model's hidden states.

Two-Phase Pipeline

Phase 1: Retrieval

Extract suffix N-grams from the token sequence, hash them to get indices, and retrieve embeddings from the table. This is O(1) - constant time!

Phase 2: Fusion

Use the current hidden state to "gate" the retrieved embeddings. If they match the context, use them; if not, suppress them.

Tokenizer Compression

Standard tokenizers often assign different IDs to semantically equivalent tokens (e.g., "Apple" vs " apple" vs "APPLE"). Engram compresses these into canonical forms:

xt=P(xt)x'_t = \mathcal{P}(x_t)

Symbol Breakdown:

xtx_tThe original token ID at position tt in the sequence (e.g., token ID 1234 for "Apple")
xtx'_tThe compressed/canonical token ID after normalization (e.g., all variants of "apple" map to same ID)
P\mathcal{P}The projection function that maps tokens to canonical forms using NFKC normalization, lowercasing, etc.
VVV \to V'Maps from original vocabulary VV (128k tokens) to compressed vocabulary VV' (~23% smaller)

This surjective (many-to-one) function collapses ~23% of vocabulary, increasing semantic density by ensuring "Apple", " apple", and "APPLE" all retrieve the same embeddings.

Multi-Head Hashing

To handle hash collisions, Engram uses K different hash functions per N-gram order. Each head maps to a different embedding table:

zt,n,kφn,k(gt,n),et,n,k=En,k[zt,n,k]z_{t,n,k} \triangleq \varphi_{n,k}(g_{t,n}), \quad \mathbf{e}_{t,n,k} = \mathbf{E}_{n,k}[z_{t,n,k}]

Symbol Breakdown:

zt,n,kz_{t,n,k}
The hash index for the N-gram at position tt, using N-gram order nn and hash head kk. This is an integer pointing to a slot in the embedding table.
gt,ng_{t,n}
The N-gram (sequence of nn tokens) ending at position tt. For example, if n=2n=2 and position 5 has "the cat", then g5,2=g_{5,2} = ("the", "cat").
φn,k\varphi_{n,k}
A multiplicative-XOR hash function. Each (n-gram order, head index) pair has its own hash function with different seed parameters to ensure different collision patterns.
En,k\mathbf{E}_{n,k}
The embedding table for N-gram order nn and head kk. This is a 2D matrix of shape (num_slots × embedding_dim). Each row stores a learned embedding vector.
et,n,k\mathbf{e}_{t,n,k}
The retrieved embedding vector from table En,k\mathbf{E}_{n,k} at index zt,n,kz_{t,n,k}. This is the "memory" associated with the N-gram.
\triangleq
"Defined as" - indicates a definition rather than an equality derived from other equations.

Why multiple heads? Using K different hash functions means K different embeddings per N-gram. This reduces the impact of hash collisions - if two different N-grams collide in one head, they likely won't collide in all K heads.

All retrieved embeddings are concatenated:

etn=2Nk=1Ket,n,k\mathbf{e}_t \triangleq \|_{n=2}^{N} \|_{k=1}^{K} \mathbf{e}_{t,n,k}

Symbol Breakdown:

et\mathbf{e}_t
The final concatenated embedding at position tt. This is a single large vector that combines all retrieved memories for this position.
\|
Concatenation operator - joins vectors end-to-end. If two 128-dim vectors are concatenated, the result is a 256-dim vector.
n=2N\|_{n=2}^{N}
Concatenate across all N-gram orders from 2 to N. For example, if N=3, concatenate bi-gram and tri-gram embeddings.
k=1K\|_{k=1}^{K}
Concatenate across all K hash heads. If K=4, each N-gram order contributes 4 embeddings.

Dimension calculation: If each embedding has dimension d, and we use N-gram orders {2, 3} with K=4 heads, the final et\mathbf{e}_t has dimension: 2 × 4 × d = 8d

Context-Aware Gating

Why Gating?
Retrieved embeddings are static and context-independent. They might be wrong due to hash collisions or polysemy (same word, different meanings). The gating mechanism uses the model's understanding of context to decide whether to trust the retrieved memory.

The gating mechanism computes a scalar αt(0,1)\alpha_t \in (0, 1) that controls how much of the retrieved memory to use:

αt=σ(RMSNorm(ht)RMSNorm(kt)d)\alpha_t = \sigma\left(\frac{\text{RMSNorm}(\mathbf{h}_t)^\top \text{RMSNorm}(\mathbf{k}_t)}{\sqrt{d}}\right)

Symbol Breakdown:

αt\alpha_t
The gate value at position t, ranging from 0 to 1. Controls how much retrieved memory is used: 0 = ignore memory, 1 = fully trust memory.
σ\sigma
The sigmoid function σ(x)=11+ex\sigma(x) = \frac{1}{1+e^{-x}}. Squashes any real number into the (0, 1) range.
ht\mathbf{h}_t
The hidden state vector at position t from the transformer. This represents the model's "understanding" of context up to this point.
kt\mathbf{k}_t
The "key" projection of the retrieved memory: kt=WKet\mathbf{k}_t = \mathbf{W}_K \mathbf{e}_t, where WK\mathbf{W}_K is a learned weight matrix.
RMSNorm\text{RMSNorm}
Root Mean Square Layer Normalization - normalizes vectors to have unit RMS, stabilizing training without centering.
^\top
Transpose operator. The dot product ab\mathbf{a}^\top \mathbf{b} measures similarity between vectors a and b.
d\sqrt{d}
Square root of the embedding dimension. Dividing by this prevents the dot product from growing too large in high dimensions (standard scaled dot-product attention trick).

Intuition: This computes cosine similarity between the hidden state (what the model expects) and the retrieved memory (what we found). If they align (high similarity), αt1\alpha_t \to 1 and we use the memory. If they conflict (low similarity), αt0\alpha_t \to 0 and we ignore it.

If the retrieved memory contradicts the context, αt0\alpha_t \to 0, effectively ignoring it. A final depthwise convolution adds local smoothing:

Y=SiLU(Conv1D(RMSNorm(V~)))+V~\mathbf{Y} = \text{SiLU}(\text{Conv1D}(\text{RMSNorm}(\tilde{\mathbf{V}}))) + \tilde{\mathbf{V}}

Symbol Breakdown:

Y\mathbf{Y}
The final output of the Engram module - the "memory contribution" that gets added to the model's hidden state.
V~\tilde{\mathbf{V}}
The gated value: V~t=αtvt\tilde{\mathbf{V}}_t = \alpha_t \cdot \mathbf{v}_t, where vt=WVet\mathbf{v}_t = \mathbf{W}_V \mathbf{e}_t is the value projection of retrieved memory, scaled by the gate.
Conv1D\text{Conv1D}
1D depthwise convolution with kernel size 3. Applies a sliding window across the sequence, allowing information to flow between neighboring positions for local smoothing.
SiLU\text{SiLU}
Sigmoid Linear Unit activation: SiLU(x)=xσ(x)\text{SiLU}(x) = x \cdot \sigma(x). Also called "Swish". Allows smooth non-linear transformations.
+V~+ \tilde{\mathbf{V}}
Residual connection - adds the original gated value back. This ensures the module can learn to "do nothing" if needed (identity mapping).

Why convolution? The convolution smooths the memory output across adjacent positions. If position 5 strongly activates memory but position 6 doesn't, the convolution helps blend them for more coherent output.

Where to Place Engram?

There's a trade-off in placing Engram:

  • Too early: Hidden states haven't aggregated enough context for good gating
  • Too late: The model has already wasted compute on pattern reconstruction
Optimal Placement
Experiments show Layer 2 is optimal for a single Engram module. Even better: split into two modules at Layers 2 and 15, combining early intervention with rich late-stage gating.

Scaling Laws & Sparsity Allocation

The key question: given a fixed parameter budget, how should we split between MoE experts and Engram memory?

The Sparsity Allocation Problem

Define the allocation ratio ρ[0,1]\rho \in [0, 1] as the fraction of inactive parameters assigned to MoE:

PMoE(sparse)=ρPsparse,PEngram=(1ρ)PsparseP_{\text{MoE}}^{(\text{sparse})} = \rho \cdot P_{\text{sparse}}, \quad P_{\text{Engram}} = (1 - \rho) \cdot P_{\text{sparse}}

Symbol Breakdown:

PsparseP_{\text{sparse}}
Total "inactive" (sparse) parameters available for allocation. These are parameters that exist in the model but are not all activated for every input - they can be split between MoE experts and Engram memory.
ρ\rho
The allocation ratio, between 0 and 1. This is the key design choice: what fraction of sparse parameters should go to MoE vs Engram?
PMoE(sparse)P_{\text{MoE}}^{(\text{sparse})}
Parameters allocated to MoE expert networks. These are neural network weights that provide conditional computation (different experts activated for different inputs).
PEngramP_{\text{Engram}}
Parameters allocated to Engram embedding tables. These are static embeddings that provide conditional memory (different embeddings retrieved for different N-grams).

Key insight: ρPsparse+(1ρ)Psparse=Psparse\rho \cdot P_{\text{sparse}} + (1-\rho) \cdot P_{\text{sparse}} = P_{\text{sparse}}. This is a zero-sum allocation - parameters given to MoE are taken from Engram and vice versa. The question is finding the optimal balance.

  • ρ=1\rho = 1: Pure MoE (all inactive params are experts) - maximum computational flexibility, no memory
  • ρ=0\rho = 0: Pure Engram (all inactive params are memory) - maximum memory capacity, no expert diversity

The U-Shaped Discovery

Key Finding
The relationship between validation loss and allocation ratio ρ\rho follows a U-shape. Neither pure MoE nor pure Engram is optimal. The sweet spot is aroundρ75%80%\rho \approx 75\%-80\%.

Intuition Behind the U-Shape

100%

MoE-dominated: No dedicated memory for static patterns. Model wastes depth reconstructing them through computation.

75-80%

Optimal: MoE handles dynamic reasoning, Engram handles static patterns. Each does what it's best at.

0%

Engram-dominated: Loses conditional computation capacity. Memory can't replace reasoning!

Infinite Memory Scaling

What if we relax memory constraints and scale Engram aggressively? The results show a log-linear relationship: doubling memory slots predictably reduces loss.

Practical Implication
Engram provides a predictable scaling knob. Want better performance? Add more memory slots. Unlike adding experts (which increases compute), adding memory is essentially "free" at inference.

Experimental Results

The paper validates Engram at scale with four models: Dense-4B, MoE-27B, Engram-27B, and Engram-40B. All use identical training data and activated parameters.

Main Results: Engram-27B vs MoE-27B

Fair Comparison
Engram-27B is derived from MoE-27B by reducing experts from 72 to 55 and reallocating freed parameters to a 5.7B Engram memory. Same total parameters, same FLOPs!
BenchmarkMoE-27BEngram-27BGain
MMLU57.460.4+3.0
CMMLU57.961.9+4.0
BBH (Reasoning)50.955.9+5.0
ARC-Challenge70.173.8+3.7
HumanEval (Code)37.840.8+3.0
MATH28.330.7+2.4
GSM8K58.460.6+2.2
Surprising Finding
Memory was expected to help knowledge tasks (MMLU, TriviaQA). But the biggest gains are in reasoning (BBH +5.0) and code/math! Why?

Long-Context Performance

By delegating local patterns to lookups, Engram frees attention to focus on global context. Results on 32k context:

  • Multi-Query NIAH: 97.0 vs 84.2 (+12.8!)
  • Variable Tracking: 89.0 vs 77.0 (+12.0)
  • LongPPL: Consistently lower perplexity across book, paper, and code domains

Mechanistic Analysis

Why does a memory module improve reasoning? The paper provides compelling mechanistic evidence.

Entity Resolution in Standard LLMs

When a standard LLM processes "Diana, Princess of Wales", each layer progressively builds understanding:

Layer 1-2:"Country in the United Kingdom" (Wales)
Layer 3:"Country in Europe" (Wales)
Layer 4:"Title held by female sovereigns" (generic Princess)
Layer 5:"Wife of Prince of Wales" (more specific)
Layer 6:"Diana, Princess of Wales (1961-1997)" (fully resolved!)

Six layers of computation just to recognize an entity! Engram can retrieve this in O(1).

LogitLens Analysis: Faster Prediction Convergence

Using LogitLens (projecting hidden states through the final LM head), we can measure how "ready" each layer's representation is for prediction via KL divergence from the final output.

Key Finding
Engram models show systematically smaller KL divergence, especially in early layers. The representation becomes "prediction-ready" much faster because Engram handles static pattern recognition instantly.

CKA Analysis: Effective Depth Increase

Centered Kernel Alignment (CKA) measures representational similarity between layers. The analysis reveals:

Engram Layer 5MoE Layer 12\text{Engram Layer 5} \approx \text{MoE Layer 12}

Engram's Layer 5 representations align with MoE's Layer 12! The model is effectively deeper because early layers don't waste capacity on memorization.

Sensitivity Analysis: What Relies on Engram?

Ablating Engram at inference (zeroing out its output) reveals a sharp functional dichotomy:

Catastrophic Collapse

Factual Knowledge: TriviaQA retains only 29%, PopQA 44%. Engram is the primary knowledge repository.

Highly Resilient

Reading Comprehension: C3 retains 93%, RACE-Middle 89%. Context-grounded tasks rely on attention, not Engram.

Gating Visualization

The gating mechanism activates selectively on multi-token patterns:

"Only Alexander the Great could tame the horse Bucephalus."

"By the way, I am a fan of the Milky Way."

"This study analyzes the media impact of Diana, Princess of Wales."

Red highlighting shows high gating activation. Engram correctly identifies named entities and idioms.

Conclusion & Future Directions

Summary

Engram introduces conditional memory as a new primitive for LLM architecture, complementing the established conditional computation (MoE) paradigm. Key takeaways:

  • Memory lookup and neural computation are complementary, not competing
  • The optimal allocation follows a U-shaped law (~75-80% to MoE)
  • Memory benefits reasoning by freeing depth for complex operations
  • Deterministic addressing enables infrastructure-aware scaling

System Efficiency Highlight

100B Parameters, <3% Overhead
Unlike MoE's dynamic routing, Engram's deterministic hash-based retrieval enables prefetching. A 100B-parameter table offloaded to CPU memory adds only ~2.8% latency overhead. This means Engram can scale to massive sizes without GPU memory constraints!

Future Directions

Higher-Order N-grams

Current work uses 2-3 grams. Larger scales may benefit from 4+ grams.

Dynamic Memory Updates

Current embeddings are static. Online learning could enable adaptation.

Multi-Modal Extension

Could image/audio patterns benefit from similar lookup mechanisms?

Trillion-Scale Memory

With NVMe SSDs in the hierarchy, could push to much larger tables.

The Big Picture

Engram represents a shift in how we think about LLM architecture. Instead of making neural networks do everything through computation, we can complement them with specialized primitives for different tasks. Conditional memory for knowledge retrieval is just the beginning - what other cognitive functions could benefit from dedicated architectural support?