Off-policy Methods with Approximation
Explore the treacherous territory where off-policy learning meets function approximation. Understand why things can go wrong — and how to fix them!
Learning from others' experience — carefully!
Off-policy learning lets us learn about one policy (the target) while following another (the behavior). Super powerful! But when we add function approximation, things can go dangerously wrong. Values can diverge to infinity! This chapter explains the "deadly triad" and introduces stable algorithms like Gradient-TD that guarantee convergence.
The Deadly Triad
Why TD can diverge
Gradient-TD
Stable off-policy learning
Emphatic-TD
Proper emphasis weighting
The Big Picture: The Challenge of Off-Policy + Approximation
Tabular Off-Policy
Q-learning with tables always converges. Each state is independent.
Chapter 6: Safe ✓
On-Policy + Approx
Semi-gradient Sarsa converges reliably. Same data distribution!
Chapter 10: Safe ✓
Off-Policy + Approx
DANGER! Standard TD can diverge to ±∞!
Chapter 11: Special care needed
Why care about off-policy? Because it's how we do experience replay, learn from demonstrations, or explore with one policy while improving another. Modern deep RL depends on it!
Semi-gradient Methods
Extending semi-gradient TD to off-policy learning.
The importance sampling trick!
When our behavior policy b chooses different actions than our target policy π, we need to reweight updates. If π would do an action more often than b, we boost the update; if less often, we shrink it. This correction is called importance sampling.
The per-step importance sampling ratio corrects for the difference between target and behavior policies:
What Each Term Means:
Understanding ρ Through Examples
π and b agree perfectly on this action. No correction needed.
π(A|S) = b(A|S)
π would do this 2× more often than b. Boost the update!
π(A|S) = 0.4, b(A|S) = 0.2
π would never take this action. Ignore the update completely!
π(A|S) = 0 (deterministic π)
The off-policy version of semi-gradient TD multiplies by the importance sampling ratio:
where δt = Rt+1 + γv̂(St+1, w) − v̂(St, w) is the TD error.
Why only ρt?
Because TD bootstraps! We only need to correct for the one action we took. The next state's value v̂(St+1) is already our estimate regardless of what action led there. This is simpler than Monte Carlo's product of all ρ's!
Action-Value Methods: Expected Sarsa & Q-learning
For action-value functions, we have more options for the target:
Semi-gradient Expected Sarsa
where δt = R + γ Σa π(a|S') q̂(S', a) − q̂(S, A)
No ρ needed — expectation covers all actions!
Semi-gradient Q-learning
where δt = R + γ maxa q̂(S', a) − q̂(S, A)
Also no ρ — target is greedy max!
Unlike tabular methods, semi-gradient off-policy TD with function approximation can diverge — weights can grow to infinity! This includes Q-learning with linear function approximation. The next sections explain why and what to do about it.
Examples of Off-policy Divergence
When semi-gradient methods go wrong.
When values explode!
Here's the scary part: even simple problems with linear function approximation can cause semi-gradient TD to diverge to ±∞. Baird's counterexample is the most famous demonstration. It's not a bug — it's a fundamental incompatibility between bootstrapping, off-policy updates, and function approximation.
A 7-state MDP with linear function approximation where semi-gradient TD(0) diverges:
The Setup
What Happens
True values: vπ(s) = 0 for all states (no rewards ever).
But with semi-gradient TD(0), the weights diverge to infinity!
The problem: function approximation links the states, so updating one affects others in ways that compound catastrophically under off-policy distribution.
Why Does This Happen?
Off-policy distribution mismatch
We update states 1-6 often (behavior policy), but π goes straight to state 7. The distribution we update from ≠ the distribution we care about.
Function approximation couples states
Linear features share weights, so updating v̂(s1) changes v̂(s7) too!
Bootstrapping creates feedback loops
We update towards v̂(S'), which is also wrong. Errors compound!
Even On-Policy Can Diverge!
Tsitsiklis and Van Roy (1997) showed divergence even for on-policy semi-gradient TD with nonlinear function approximation (neural networks)! The problem isn't just off-policy.
Key insight: Divergence can occur whenever the update distribution doesn't match what the algorithm "expects." Off-policy makes this worse, but it's fundamentally about the interaction of bootstrapping + function approximation.
For the Curious Mind: Why Tabular Q-Learning Is Safe
Tabular Q-learning converges even off-policy because each state-action pair is updated independently. There's no coupling through shared weights. The "function approximation" in tabular methods is a lookup table where each entry is independent!
Fun fact: You can view tabular as linear function approximation with one-hot features. Each state gets its own dedicated weight, so there's no interference.
The Deadly Triad
Three ingredients that together spell danger.
The three horsemen of instability!
The deadly triad names the three elements that, when combined, can cause instability and divergence. Remove any one, and things work fine. But all three together? Watch out! Understanding this helps us design stable algorithms.
Instability and divergence can occur when all three are present:
Function Approximation
Generalizing across states (neural nets, linear features, etc.)
Bootstrapping
Updating towards estimates that include existing estimates (TD, not MC)
Off-policy Training
Learning about policy π from data generated by different policy b
Remove One Element → Safety!
Tabular methods! Q-learning, SARSA with tables all converge. Each state is independent.
Monte Carlo methods! Wait for complete returns. No dependence on current estimates. Gradient descent on true returns converges.
On-policy methods! Semi-gradient Sarsa, TD(λ) with same policy converge (with linear approximation at least).
Why Do We Want All Three?
Function Approximation
Essential for large state spaces — can't have a table for every pixel combination!
Bootstrapping
Much faster learning than MC. No need to wait for episode ends. Lower variance.
Off-policy
Experience replay, learning from demonstrations, parallel data collection, exploration!
The challenge: Modern deep RL (DQN, etc.) uses all three! So we need special techniques to make it work — target networks, experience replay sampling, clipping, and the methods in this chapter.
Divergence happens when updates systematically push weights in directions that increase errors on states we don't visit (under behavior policy), creating a feedback loop. The "hill" we're climbing gets steeper and steeper.
The next sections develop the math to understand this precisely and find algorithms that do converge!
Linear Value-function Geometry
Understanding value functions as vectors in high-dimensional space.
Value functions live in a geometric space!
Think of every possible value function as a point in a high-dimensional space (one dimension per state). Linear function approximation restricts us to a subspace — a flat sheet within this space. This geometric view helps us understand what TD is really doing!
A value function over |S| states can be written as a vector:
The true value function vπ is one specific point. Function approximation constrains us to a subset of this space.
The Linear Subspace
With linear function approximation v̂(s) = wTx(s), we can only represent value functions that are linear combinations of our features.
Where:
- • X is the |S| × d feature matrix (each row is features for one state)
- • w is the d-dimensional weight vector
- • The set of all achievable v̂w forms a d-dimensional subspace
The projection Π finds the closest point in our representable subspace:
where ||·||μ is the norm weighted by state distribution μ.
Intuition:
If you can't represent vπ exactly, find the best approximation you can represent. Like projecting a 3D point onto a 2D plane — you lose some information but get as close as possible.
The Bellman Operator
The Bellman operator Bπ takes any value function and produces a "better" one:
Fixed Point
vπ = Bπvπ
The true value function is unchanged by Bπ
Problem
Bπv̂w usually lands outside our subspace!
Can't represent Bπv̂ with our features
TD methods try to find the fixed point of the projected Bellman operator:
Apply Bellman, then project back to representable subspace. The TD fixed point is where this cycle stabilizes.
Geometric Picture
Imagine a 2D subspace (blue plane) in 3D value-function space:
- vπ: true value (may be off the plane)
- Πvπ: projection onto plane (best we can do)
- TD fixed point: ΠBπ iteration converges here
On-policy: TD finds a point on the plane that's close to vπ (usually).
Off-policy: The projection uses wrong distribution μ, and ΠBπ may not be a contraction! Can spiral outward → divergence.
Gradient Descent in the Bellman Error
A natural idea that doesn't quite work.
What if we just minimize the Bellman error?
The TD error measures how far we are from satisfying the Bellman equation. Natural idea: do gradient descent on this error! This section shows why the obvious approach doesn't work, leading us to better solutions.
The Bellman error at state s measures violation of the Bellman equation:
This is the expected TD error from state s. If v̂ = vπ, this would be zero for all states.
Average the squared Bellman error over the state distribution:
Minimizing MSBE seems natural — make the value function satisfy Bellman everywhere!
The Problem: Double Sampling
To compute the gradient of MSBE, we need:
But δ̄w(s) involves an expectation over next states, and so does ∇δ̄w(s)!
We need two independent samples of the next state from the same (s, a) to get an unbiased estimate. Usually we only have one!
Naive Residual Gradient: Biased!
If we naively use the same sample for both parts:
This is the residual gradient algorithm. It converges... but to the wrong answer! The bias from using one sample instead of two causes systematic error.
When Double Sampling Is Possible
Deterministic Environment
If S' is deterministic given (S, A), we effectively have infinite samples! Same S' every time.
Model Available
If we have a model, we can sample two independent next states from (S, A).
With double sampling, minimizing MSBE is valid. But there's another issue...
The Bellman Error is Not Learnable
A deeper problem with MSBE as an objective.
MSBE is fundamentally flawed!
Even if we could minimize MSBE perfectly, it turns out to be the wrong objective. Different MDPs can produce identical data but have different optimal MSBE solutions! The true Bellman error is "not learnable" from data alone.
Consider two different MDPs that produce exactly the same data distribution:
MDP A: From state s, action a leads deterministically to s'
MDP B: From state s, action a leads to s' or s'' with 50% each, but we only ever see s' (by chance or we stop early)
Same data, but different Bellman errors! We can't tell them apart from experience alone.
Observational Equivalence
The Bellman error depends on the transition dynamics of the MDP, not just the data distribution. But from finite samples, we can't distinguish:
What we observe: Sequences of (S, A, R, S') tuples
What BE needs: The full transition distribution p(s'|s,a)
This means MSBE is not a valid learning objective for model-free RL from experience!
Instead of MSBE, we should minimize the Mean Squared Projected Bellman Error (MSPBE):
The projection Π makes this learnable! We don't need full dynamics — just the projected version, which depends only on observable quantities.
Why MSPBE Works
- Learnable: Can be estimated from samples without double sampling
- Same fixed point: MSPBE = 0 at exactly the TD fixed point
- Unique minimum: Convex in the linear case, guaranteed convergence
For the Curious Mind: Why the Projection Helps
The projection operator Π is defined by our features and state distribution. It turns the unknowable "true Bellman error" into something that depends only on expectations we can estimate from data.
Mathematically: ΠBv depends on E[x(S)x(S)T] and E[x(S)(R + γv(S'))], both of which are expectations over observable quantities!
Gradient-TD Methods
Stable algorithms that minimize MSPBE.
Finally — algorithms that work!
Gradient-TD methods do true stochastic gradient descent on MSPBE. They're guaranteed to converge even with off-policy data and function approximation! The key insight: use a second set of weights to estimate intermediate quantities.
Gradient-TD minimizes the Mean Squared Projected Bellman Error:
This looks scary! But the gradient can be estimated from samples using a clever trick.
GTD2 Algorithm
Gradient TD with gradient correction, using auxiliary weights v:
Update weights w:
Update auxiliary weights v:
What does v do?
v estimates the "direction of the Bellman error" in feature space. It's like computing an expected TD error, but projected onto our features. This lets us compute the MSPBE gradient without matrix inversions!
TDC (TD with Gradient Correction)
An equivalent formulation that's closer to standard TD:
Weight update:
Intuition:
The first term (δtxt) is regular semi-gradient TD! The second term is a correction that fixes the bias from off-policy bootstrapping. Hence "TD with Gradient Correction".
Semi-gradient TD vs Gradient-TD
| Property | Semi-gradient TD | Gradient-TD |
|---|---|---|
| Off-policy convergence | Can diverge | Guaranteed |
| Computation | O(d) | O(d) with extra v vector |
| Hyperparameters | One (α) | Two (α, β) |
| Objective minimized | None (follows gradient) | MSPBE |
GTD2 and TDC have natural extensions to action values q(s, a). GQ(λ) is the eligibility-trace version for control. These are the foundations for stable off-policy deep RL!
Emphatic-TD Methods
Reweighting updates to fix the distribution mismatch.
What if we could fix the distribution?
Off-policy divergence happens because we update states according to the behavior distribution, but need estimates for the target distribution. Emphatic-TD reweights updates to simulate the correct distribution, making even semi-gradient TD convergent!
Give more weight to states that would be visited more often under the target policy:
The new term Mt is the emphasis, which accumulates the importance of updating at the current state.
The emphasis Mt follows a recursive update:
What Each Term Means:
Intuition: Accumulated Importance
Mt tracks how "responsible" the target policy is for reaching state St:
High Mt
Target policy π would visit this state often. Give updates more weight!
Low Mt
Target policy rarely visits here (we got here via behavior policy). Downweight to avoid distorting estimates!
Emphatic TD(λ) with linear function approximation converges to the TD fixed point under off-policy training, assuming:
- • Behavior policy b covers target π (b(a|s) > 0 wherever π(a|s) > 0)
- • Step size α decreases appropriately
- • Interest It is bounded
The Variance Trade-off
Emphatic weighting can have high variance:
- •Mt is a product of importance ratios, which can grow large
- •Large Mt means large updates → high variance in learning
- •May need smaller step sizes or variance reduction techniques
Reducing Variance
Making off-policy learning practical.
Making importance sampling practical!
Off-policy methods multiply by importance ratios ρ, which can be huge. This causes high variance that slows learning. This section covers techniques to reduce variance while keeping the algorithms correct.
Products of importance ratios can explode:
If each ρ ≈ 2, then over 10 steps: ρt:T ≈ 210 = 1024! Updates become extremely noisy.
Variance Reduction Techniques
Truncated Importance Sampling
Clip ρ to a maximum value (like 1.0). Introduces bias but dramatically reduces variance.
Weighted Importance Sampling
Normalize by the sum of importance ratios. Biased but bounded variance. No explosive growth!
Control Variates
Subtract a baseline that has the same expectation but lower variance. Classic variance reduction from statistics!
Adaptive Methods
Automatically balance bias and variance based on observed ratios. Tree Backup, Retrace(λ), and V-trace from deep RL.
Modern Solutions: Retrace(λ) and V-trace
These methods adaptively truncate importance ratios to balance bias-variance:
Retrace(λ)
Uses truncated ratios cs = λ min(1, ρs). Safe off-policy learning with eligibility traces.
V-trace (IMPALA)
Clips both ρ and c separately. Powers distributed training in IMPALA with highly off-policy data.
In practice, some bias is often acceptable for massive variance reduction. Deep RL systems like DQN, IMPALA, and R2D2 all use truncated importance sampling. The theory says "biased" but experiments show it works great!
Summary
Key takeaways from off-policy learning with approximation.
Chapter 11 Key Takeaways
The Deadly Triad causes instability
Function approximation + bootstrapping + off-policy = potential divergence. Remove any one element and you're safe.
MSBE is not learnable
The true Bellman error depends on transition dynamics, not just data. Use MSPBE instead!
Gradient-TD methods are stable
GTD2 and TDC do true gradient descent on MSPBE, guaranteeing convergence even off-policy.
Emphatic-TD reweights updates
By emphasizing states according to target policy visitation, even semi-gradient TD becomes convergent.
Variance is the practical bottleneck
Importance sampling products can explode. Truncation, weighted IS, and adaptive methods (V-trace, Retrace) make off-policy practical.
Formula Cheat Sheet
Importance Sampling Ratio
Semi-gradient Off-policy TD
TDC Update
Emphatic Weighting
Quick Glossary
Importance Sampling
Reweighting samples from one distribution to estimate expectations under another
Deadly Triad
Function approximation + bootstrapping + off-policy training
MSPBE
Mean Squared Projected Bellman Error — the right objective for TD
Gradient-TD
True gradient descent methods (GTD2, TDC) that converge off-policy
Emphatic-TD
Methods that reweight updates by accumulated importance
Baird's Counterexample
Famous MDP showing semi-gradient TD divergence
Common Mistakes to Avoid
Assuming Q-learning always works: With function approximation, even Q-learning can diverge! Need special care.
Ignoring importance sampling variance: Products of ρ can explode. Always consider truncation or weighted IS in practice.
Confusing MSBE and MSPBE: MSBE is not learnable from data! MSPBE is the correct objective for model-free learning.
Forgetting the auxiliary vector: Gradient-TD methods need both w and v. Don't skip the v updates!