Chapter 11

Off-policy Methods with Approximation

Explore the treacherous territory where off-policy learning meets function approximation. Understand why things can go wrong — and how to fix them!

Learning from others' experience — carefully!

Off-policy learning lets us learn about one policy (the target) while following another (the behavior). Super powerful! But when we add function approximation, things can go dangerously wrong. Values can diverge to infinity! This chapter explains the "deadly triad" and introduces stable algorithms like Gradient-TD that guarantee convergence.

The Deadly Triad

Why TD can diverge

Gradient-TD

Stable off-policy learning

Emphatic-TD

Proper emphasis weighting

The Big Picture: The Challenge of Off-Policy + Approximation

Tabular Off-Policy

Q-learning with tables always converges. Each state is independent.

Chapter 6: Safe ✓

On-Policy + Approx

Semi-gradient Sarsa converges reliably. Same data distribution!

Chapter 10: Safe ✓

Off-Policy + Approx

DANGER! Standard TD can diverge to ±∞!

Chapter 11: Special care needed

Why care about off-policy? Because it's how we do experience replay, learn from demonstrations, or explore with one policy while improving another. Modern deep RL depends on it!

Section 11.1

Semi-gradient Methods

Extending semi-gradient TD to off-policy learning.

The importance sampling trick!

When our behavior policy b chooses different actions than our target policy π, we need to reweight updates. If π would do an action more often than b, we boost the update; if less often, we shrink it. This correction is called importance sampling.

Importance Sampling Ratio

The per-step importance sampling ratio corrects for the difference between target and behavior policies:

\rho_t \doteq \frac{\pi(A_t | S_t)}{b(A_t | S_t)}

What Each Term Means:

π(A|S)= probability target policy π would take action A in state S

b(A|S)= probability behavior policy b takes action A in state S

ρ_t= how much more (or less) likely π would do what b did

Understanding ρ Through Examples

ρ = 1

π and b agree perfectly on this action. No correction needed.

π(A|S) = b(A|S)

ρ = 2

π would do this 2× more often than b. Boost the update!

π(A|S) = 0.4, b(A|S) = 0.2

ρ = 0

π would never take this action. Ignore the update completely!

π(A|S) = 0 (deterministic π)

Semi-gradient Off-policy TD(0)

The off-policy version of semi-gradient TD multiplies by the importance sampling ratio:

\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha \rho_t \delta_t \nabla \hat{v}(S_t, \mathbf{w}_t)

where δ_t = R_t+1 + γv̂(S_t+1, w) − v̂(S_t, w) is the TD error.

Why only ρ_t?

Because TD bootstraps! We only need to correct for the one action we took. The next state's value v̂(S_t+1) is already our estimate regardless of what action led there. This is simpler than Monte Carlo's product of all ρ's!

Action-Value Methods: Expected Sarsa & Q-learning

For action-value functions, we have more options for the target:

Semi-gradient Expected Sarsa

\mathbf{w} \leftarrow \mathbf{w} + \alpha \delta_t \nabla \hat{q}(S_t, A_t, \mathbf{w})

where δ_t = R + γ Σ_a π(a|S') q̂(S', a) − q̂(S, A)

No ρ needed — expectation covers all actions!

Semi-gradient Q-learning

\mathbf{w} \leftarrow \mathbf{w} + \alpha \delta_t \nabla \hat{q}(S_t, A_t, \mathbf{w})

where δ_t = R + γ max_a q̂(S', a) − q̂(S, A)

Also no ρ — target is greedy max!

Convergence Not Guaranteed!

Unlike tabular methods, semi-gradient off-policy TD with function approximation can diverge — weights can grow to infinity! This includes Q-learning with linear function approximation. The next sections explain why and what to do about it.

Section 11.2

Examples of Off-policy Divergence

When semi-gradient methods go wrong.

When values explode!

Here's the scary part: even simple problems with linear function approximation can cause semi-gradient TD to diverge to ±∞. Baird's counterexample is the most famous demonstration. It's not a bug — it's a fundamental incompatibility between bootstrapping, off-policy updates, and function approximation.

Baird's Counterexample

A 7-state MDP with linear function approximation where semi-gradient TD(0) diverges:

The Setup

States:7 states, with 6 "top" states and 1 "bottom" state

Actions:Dashed (to top states) or Solid (to bottom state 7)

Behavior b:Selects dashed uniformly at random (stays in top states)

Target π:Always selects solid action (always goes to state 7)

Rewards:All zero! γ = 0.99

What Happens

True values: v_π(s) = 0 for all states (no rewards ever).

But with semi-gradient TD(0), the weights diverge to infinity!

The problem: function approximation links the states, so updating one affects others in ways that compound catastrophically under off-policy distribution.

Why Does This Happen?

Off-policy distribution mismatch

We update states 1-6 often (behavior policy), but π goes straight to state 7. The distribution we update from ≠ the distribution we care about.

Function approximation couples states

Linear features share weights, so updating v̂(s₁) changes v̂(s₇) too!

Bootstrapping creates feedback loops

We update towards v̂(S'), which is also wrong. Errors compound!

Even On-Policy Can Diverge!

Tsitsiklis and Van Roy (1997) showed divergence even for on-policy semi-gradient TD with nonlinear function approximation (neural networks)! The problem isn't just off-policy.

Key insight: Divergence can occur whenever the update distribution doesn't match what the algorithm "expects." Off-policy makes this worse, but it's fundamentally about the interaction of bootstrapping + function approximation.

For the Curious Mind: Why Tabular Q-Learning Is Safe

Tabular Q-learning converges even off-policy because each state-action pair is updated independently. There's no coupling through shared weights. The "function approximation" in tabular methods is a lookup table where each entry is independent!

Fun fact: You can view tabular as linear function approximation with one-hot features. Each state gets its own dedicated weight, so there's no interference.

Section 11.3

The Deadly Triad

Three ingredients that together spell danger.

The three horsemen of instability!

The deadly triad names the three elements that, when combined, can cause instability and divergence. Remove any one, and things work fine. But all three together? Watch out! Understanding this helps us design stable algorithms.

The Deadly Triad

Instability and divergence can occur when all three are present:

Function Approximation

Generalizing across states (neural nets, linear features, etc.)

Bootstrapping

Updating towards estimates that include existing estimates (TD, not MC)

Off-policy Training

Learning about policy π from data generated by different policy b

Remove One Element → Safety!

No Function Approximation

Tabular methods! Q-learning, SARSA with tables all converge. Each state is independent.

No Bootstrapping

Monte Carlo methods! Wait for complete returns. No dependence on current estimates. Gradient descent on true returns converges.

No Off-policy

On-policy methods! Semi-gradient Sarsa, TD(λ) with same policy converge (with linear approximation at least).

Why Do We Want All Three?

Function Approximation

Essential for large state spaces — can't have a table for every pixel combination!

Bootstrapping

Much faster learning than MC. No need to wait for episode ends. Lower variance.

Off-policy

Experience replay, learning from demonstrations, parallel data collection, exploration!

The challenge: Modern deep RL (DQN, etc.) uses all three! So we need special techniques to make it work — target networks, experience replay sampling, clipping, and the methods in this chapter.

The Geometry of Divergence

Divergence happens when updates systematically push weights in directions that increase errors on states we don't visit (under behavior policy), creating a feedback loop. The "hill" we're climbing gets steeper and steeper.

The next sections develop the math to understand this precisely and find algorithms that do converge!

Section 11.4

Linear Value-function Geometry

Understanding value functions as vectors in high-dimensional space.

Value functions live in a geometric space!

Think of every possible value function as a point in a high-dimensional space (one dimension per state). Linear function approximation restricts us to a subspace — a flat sheet within this space. This geometric view helps us understand what TD is really doing!

Value Functions as Vectors

A value function over |S| states can be written as a vector:

v = [v(s_1), v(s_2), \ldots, v(s_{|S|})]^\top \in \mathbb{R}^{|S|}

The true value function v_π is one specific point. Function approximation constrains us to a subset of this space.

The Linear Subspace

With linear function approximation v̂(s) = w^Tx(s), we can only represent value functions that are linear combinations of our features.

\hat{v}_\mathbf{w} = \mathbf{X}\mathbf{w}

Where:

• X is the |S| × d feature matrix (each row is features for one state)
• w is the d-dimensional weight vector
• The set of all achievable v̂_w forms a d-dimensional subspace

Projection Operator

The projection Π finds the closest point in our representable subspace:

\Pi v = \arg\min_{v' \in \text{subspace}} \| v - v' \|^2_\mu

where ||·||_μ is the norm weighted by state distribution μ.

Intuition:

If you can't represent v_π exactly, find the best approximation you can represent. Like projecting a 3D point onto a 2D plane — you lose some information but get as close as possible.

The Bellman Operator

The Bellman operator B_π takes any value function and produces a "better" one:

(B_\pi v)(s) = \sum_a \pi(a|s) \sum_{s',r} p(s',r|s,a) [r + \gamma v(s')]

Fixed Point

v_π = B_πv_π

The true value function is unchanged by B_π

Problem

B_πv̂_w usually lands outside our subspace!

Can't represent B_πv̂ with our features

The Projected Bellman Equation

TD methods try to find the fixed point of the projected Bellman operator:

v_\mathbf{w} = \Pi B_\pi v_\mathbf{w}

Apply Bellman, then project back to representable subspace. The TD fixed point is where this cycle stabilizes.

Geometric Picture

Imagine a 2D subspace (blue plane) in 3D value-function space:

v_π: true value (may be off the plane)
Πv_π: projection onto plane (best we can do)
TD fixed point: ΠB_π iteration converges here

On-policy: TD finds a point on the plane that's close to v_π (usually).

Off-policy: The projection uses wrong distribution μ, and ΠB_π may not be a contraction! Can spiral outward → divergence.

Section 11.5

Gradient Descent in the Bellman Error

A natural idea that doesn't quite work.

What if we just minimize the Bellman error?

The TD error measures how far we are from satisfying the Bellman equation. Natural idea: do gradient descent on this error! This section shows why the obvious approach doesn't work, leading us to better solutions.

Bellman Error

The Bellman error at state s measures violation of the Bellman equation:

\bar{\delta}_\mathbf{w}(s) = \mathbb{E}_\pi \big[ R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}) - \hat{v}(S_t, \mathbf{w}) \mid S_t = s \big]

This is the expected TD error from state s. If v̂ = v_π, this would be zero for all states.

Mean Squared Bellman Error (MSBE)

Average the squared Bellman error over the state distribution:

\overline{\text{BE}}(\mathbf{w}) = \sum_s \mu(s) \bar{\delta}_\mathbf{w}(s)^2

Minimizing MSBE seems natural — make the value function satisfy Bellman everywhere!

The Problem: Double Sampling

To compute the gradient of MSBE, we need:

\nabla \overline{\text{BE}}(\mathbf{w}) = 2 \sum_s \mu(s) \bar{\delta}_\mathbf{w}(s) \nabla \bar{\delta}_\mathbf{w}(s)

But δ̄_w(s) involves an expectation over next states, and so does ∇δ̄_w(s)!

We need two independent samples of the next state from the same (s, a) to get an unbiased estimate. Usually we only have one!

Naive Residual Gradient: Biased!

If we naively use the same sample for both parts:

\mathbf{w} \leftarrow \mathbf{w} + \alpha \delta_t \big( \nabla \hat{v}(S_t) - \gamma \nabla \hat{v}(S_{t+1}) \big)

This is the residual gradient algorithm. It converges... but to the wrong answer! The bias from using one sample instead of two causes systematic error.

When Double Sampling Is Possible

Deterministic Environment

If S' is deterministic given (S, A), we effectively have infinite samples! Same S' every time.

Model Available

If we have a model, we can sample two independent next states from (S, A).

With double sampling, minimizing MSBE is valid. But there's another issue...

Section 11.6

The Bellman Error is Not Learnable

A deeper problem with MSBE as an objective.

MSBE is fundamentally flawed!

Even if we could minimize MSBE perfectly, it turns out to be the wrong objective. Different MDPs can produce identical data but have different optimal MSBE solutions! The true Bellman error is "not learnable" from data alone.

The Learnability Problem

Consider two different MDPs that produce exactly the same data distribution:

MDP A: From state s, action a leads deterministically to s'

MDP B: From state s, action a leads to s' or s'' with 50% each, but we only ever see s' (by chance or we stop early)

Same data, but different Bellman errors! We can't tell them apart from experience alone.

Observational Equivalence

The Bellman error depends on the transition dynamics of the MDP, not just the data distribution. But from finite samples, we can't distinguish:

What we observe: Sequences of (S, A, R, S') tuples

What BE needs: The full transition distribution p(s'|s,a)

This means MSBE is not a valid learning objective for model-free RL from experience!

The Solution: Projected Bellman Error (PBE)

Instead of MSBE, we should minimize the Mean Squared Projected Bellman Error (MSPBE):

\overline{\text{PBE}}(\mathbf{w}) = \| \Pi B_\pi \hat{v}_\mathbf{w} - \hat{v}_\mathbf{w} \|^2_\mu

The projection Π makes this learnable! We don't need full dynamics — just the projected version, which depends only on observable quantities.

Why MSPBE Works

Learnable: Can be estimated from samples without double sampling
Same fixed point: MSPBE = 0 at exactly the TD fixed point
Unique minimum: Convex in the linear case, guaranteed convergence

For the Curious Mind: Why the Projection Helps

The projection operator Π is defined by our features and state distribution. It turns the unknowable "true Bellman error" into something that depends only on expectations we can estimate from data.

Mathematically: ΠBv depends on E[x(S)x(S)^T] and E[x(S)(R + γv(S'))], both of which are expectations over observable quantities!

Section 11.7

Gradient-TD Methods

Stable algorithms that minimize MSPBE.

Finally — algorithms that work!

Gradient-TD methods do true stochastic gradient descent on MSPBE. They're guaranteed to converge even with off-policy data and function approximation! The key insight: use a second set of weights to estimate intermediate quantities.

MSPBE Objective

Gradient-TD minimizes the Mean Squared Projected Bellman Error:

\overline{\text{PBE}}(\mathbf{w}) = \big( \mathbf{X}^\top \mathbf{D} (\mathbf{X}\mathbf{w} - B_\pi \mathbf{X}\mathbf{w}) \big)^\top (\mathbf{X}^\top \mathbf{D} \mathbf{X})^{-1} \big( \mathbf{X}^\top \mathbf{D} (\mathbf{X}\mathbf{w} - B_\pi \mathbf{X}\mathbf{w}) \big)

This looks scary! But the gradient can be estimated from samples using a clever trick.

GTD2 Algorithm

Gradient TD with gradient correction, using auxiliary weights v:

Update weights w:

\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha \rho_t \big( \mathbf{x}_t - \gamma \mathbf{x}_{t+1} \big) (\mathbf{x}_t^\top \mathbf{v}_t)

Update auxiliary weights v:

\mathbf{v}_{t+1} = \mathbf{v}_t + \beta \rho_t \big( \delta_t - \mathbf{x}_t^\top \mathbf{v}_t \big) \mathbf{x}_t

What does v do?

v estimates the "direction of the Bellman error" in feature space. It's like computing an expected TD error, but projected onto our features. This lets us compute the MSPBE gradient without matrix inversions!

TDC (TD with Gradient Correction)

An equivalent formulation that's closer to standard TD:

Weight update:

\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha \rho_t \big( \delta_t \mathbf{x}_t - \gamma (\mathbf{x}_t^\top \mathbf{v}_t) \mathbf{x}_{t+1} \big)

Same v update as GTD2

Intuition:

The first term (δ_tx_t) is regular semi-gradient TD! The second term is a correction that fixes the bias from off-policy bootstrapping. Hence "TD with Gradient Correction".

Semi-gradient TD vs Gradient-TD

Property	Semi-gradient TD	Gradient-TD
Off-policy convergence	Can diverge	Guaranteed
Computation	O(d)	O(d) with extra v vector
Hyperparameters	One (α)	Two (α, β)
Objective minimized	None (follows gradient)	MSPBE

Action-Value Versions

GTD2 and TDC have natural extensions to action values q(s, a). GQ(λ) is the eligibility-trace version for control. These are the foundations for stable off-policy deep RL!

Section 11.8

Emphatic-TD Methods

Reweighting updates to fix the distribution mismatch.

What if we could fix the distribution?

Off-policy divergence happens because we update states according to the behavior distribution, but need estimates for the target distribution. Emphatic-TD reweights updates to simulate the correct distribution, making even semi-gradient TD convergent!

The Emphasis Idea

Give more weight to states that would be visited more often under the target policy:

\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha M_t \rho_t \delta_t \nabla \hat{v}(S_t, \mathbf{w}_t)

The new term M_t is the emphasis, which accumulates the importance of updating at the current state.

Emphatic Weightings

The emphasis M_t follows a recursive update:

M_t = \gamma \rho_{t-1} M_{t-1} + I_t

What Each Term Means:

M_t= emphasis (how much to weight this update)

I_t= interest (how much we care about state S_t, often 1)

ρ_t-1= importance ratio from previous step

γ= discount factor (decays emphasis over time)

Intuition: Accumulated Importance

M_t tracks how "responsible" the target policy is for reaching state S_t:

High M_t

Target policy π would visit this state often. Give updates more weight!

Low M_t

Target policy rarely visits here (we got here via behavior policy). Downweight to avoid distorting estimates!

Convergence Guarantee

Emphatic TD(λ) with linear function approximation converges to the TD fixed point under off-policy training, assuming:

• Behavior policy b covers target π (b(a|s) > 0 wherever π(a|s) > 0)
• Step size α decreases appropriately
• Interest I_t is bounded

The Variance Trade-off

Emphatic weighting can have high variance:

•M_t is a product of importance ratios, which can grow large
•Large M_t means large updates → high variance in learning
•May need smaller step sizes or variance reduction techniques

Section 11.9

Reducing Variance

Making off-policy learning practical.

Making importance sampling practical!

Off-policy methods multiply by importance ratios ρ, which can be huge. This causes high variance that slows learning. This section covers techniques to reduce variance while keeping the algorithms correct.

The Variance Problem

Products of importance ratios can explode:

\rho_{t:T} = \prod_{k=t}^{T} \frac{\pi(A_k|S_k)}{b(A_k|S_k)}

If each ρ ≈ 2, then over 10 steps: ρ_t:T ≈ 2¹⁰ = 1024! Updates become extremely noisy.

Variance Reduction Techniques

Truncated Importance Sampling

Clip ρ to a maximum value (like 1.0). Introduces bias but dramatically reduces variance.

\bar{\rho}_t = \min(\rho_t, c)

Weighted Importance Sampling

Normalize by the sum of importance ratios. Biased but bounded variance. No explosive growth!

Control Variates

Subtract a baseline that has the same expectation but lower variance. Classic variance reduction from statistics!

Adaptive Methods

Automatically balance bias and variance based on observed ratios. Tree Backup, Retrace(λ), and V-trace from deep RL.

Modern Solutions: Retrace(λ) and V-trace

These methods adaptively truncate importance ratios to balance bias-variance:

Retrace(λ)

Uses truncated ratios c_s = λ min(1, ρ_s). Safe off-policy learning with eligibility traces.

V-trace (IMPALA)

Clips both ρ and c separately. Powers distributed training in IMPALA with highly off-policy data.

Practical Trade-offs

In practice, some bias is often acceptable for massive variance reduction. Deep RL systems like DQN, IMPALA, and R2D2 all use truncated importance sampling. The theory says "biased" but experiments show it works great!

Section 11.10

Summary

Key takeaways from off-policy learning with approximation.

Chapter 11 Key Takeaways

The Deadly Triad causes instability

Function approximation + bootstrapping + off-policy = potential divergence. Remove any one element and you're safe.

MSBE is not learnable

The true Bellman error depends on transition dynamics, not just data. Use MSPBE instead!

Gradient-TD methods are stable

GTD2 and TDC do true gradient descent on MSPBE, guaranteeing convergence even off-policy.

Emphatic-TD reweights updates

By emphasizing states according to target policy visitation, even semi-gradient TD becomes convergent.

Variance is the practical bottleneck

Importance sampling products can explode. Truncation, weighted IS, and adaptive methods (V-trace, Retrace) make off-policy practical.

Formula Cheat Sheet

Importance Sampling Ratio

\rho_t = \frac{\pi(A_t|S_t)}{b(A_t|S_t)}

Semi-gradient Off-policy TD

\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha \rho_t \delta_t \nabla \hat{v}(S_t, \mathbf{w})

TDC Update

\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha \rho_t \big( \delta_t \mathbf{x}_t - \gamma (\mathbf{x}_t^\top \mathbf{v}_t) \mathbf{x}_{t+1} \big)

Emphatic Weighting

M_t = \gamma \rho_{t-1} M_{t-1} + I_t

Quick Glossary

Importance Sampling

Reweighting samples from one distribution to estimate expectations under another

Deadly Triad

Function approximation + bootstrapping + off-policy training

MSPBE

Mean Squared Projected Bellman Error — the right objective for TD

Gradient-TD

True gradient descent methods (GTD2, TDC) that converge off-policy

Emphatic-TD

Methods that reweight updates by accumulated importance

Baird's Counterexample

Famous MDP showing semi-gradient TD divergence

Common Mistakes to Avoid

Assuming Q-learning always works: With function approximation, even Q-learning can diverge! Need special care.

Ignoring importance sampling variance: Products of ρ can explode. Always consider truncation or weighted IS in practice.

Confusing MSBE and MSPBE: MSBE is not learnable from data! MSPBE is the correct objective for model-free learning.

Forgetting the auxiliary vector: Gradient-TD methods need both w and v. Don't skip the v updates!

Chapter 10: On-policy Control Back to Home

Off-policy Methods with Approximation

Learning from others' experience — carefully!

The Deadly Triad

Gradient-TD

Emphatic-TD

The Big Picture: The Challenge of Off-Policy + Approximation

Tabular Off-Policy

On-Policy + Approx

Off-Policy + Approx

Semi-gradient Methods

The importance sampling trick!

Understanding ρ Through Examples

Action-Value Methods: Expected Sarsa & Q-learning

Semi-gradient Expected Sarsa

Semi-gradient Q-learning

Examples of Off-policy Divergence

When values explode!

The Setup

What Happens

Why Does This Happen?

Even On-Policy Can Diverge!

For the Curious Mind: Why Tabular Q-Learning Is Safe

The Deadly Triad

The three horsemen of instability!

Function Approximation

Bootstrapping

Off-policy Training

Remove One Element → Safety!

Why Do We Want All Three?

Linear Value-function Geometry

Value functions live in a geometric space!

The Linear Subspace

The Bellman Operator

Fixed Point

Problem

Geometric Picture

Gradient Descent in the Bellman Error

What if we just minimize the Bellman error?

The Problem: Double Sampling

Naive Residual Gradient: Biased!

When Double Sampling Is Possible

Deterministic Environment

Model Available

The Bellman Error is Not Learnable

MSBE is fundamentally flawed!

Observational Equivalence

Why MSPBE Works

For the Curious Mind: Why the Projection Helps

Gradient-TD Methods

Finally — algorithms that work!

GTD2 Algorithm

TDC (TD with Gradient Correction)

Semi-gradient TD vs Gradient-TD

Emphatic-TD Methods

What if we could fix the distribution?

Intuition: Accumulated Importance

High Mt

Low Mt

The Variance Trade-off

Reducing Variance

Making importance sampling practical!

Variance Reduction Techniques

Truncated Importance Sampling

Weighted Importance Sampling

Control Variates

Adaptive Methods

Modern Solutions: Retrace(λ) and V-trace

Retrace(λ)

V-trace (IMPALA)

Summary

Chapter 11 Key Takeaways

Formula Cheat Sheet

Quick Glossary

Common Mistakes to Avoid

High M_t

Low M_t