Chapter 11

Off-policy Methods with Approximation

Explore the treacherous territory where off-policy learning meets function approximation. Understand why things can go wrong — and how to fix them!

Learning from others' experience — carefully!

Off-policy learning lets us learn about one policy (the target) while following another (the behavior). Super powerful! But when we add function approximation, things can go dangerously wrong. Values can diverge to infinity! This chapter explains the "deadly triad" and introduces stable algorithms like Gradient-TD that guarantee convergence.

The Deadly Triad

Why TD can diverge

Gradient-TD

Stable off-policy learning

Emphatic-TD

Proper emphasis weighting

The Big Picture: The Challenge of Off-Policy + Approximation

Tabular Off-Policy

Q-learning with tables always converges. Each state is independent.

Chapter 6: Safe ✓

On-Policy + Approx

Semi-gradient Sarsa converges reliably. Same data distribution!

Chapter 10: Safe ✓

Off-Policy + Approx

DANGER! Standard TD can diverge to ±∞!

Chapter 11: Special care needed

Why care about off-policy? Because it's how we do experience replay, learn from demonstrations, or explore with one policy while improving another. Modern deep RL depends on it!

Section 11.1

Semi-gradient Methods

Extending semi-gradient TD to off-policy learning.

The importance sampling trick!

When our behavior policy b chooses different actions than our target policy π, we need to reweight updates. If π would do an action more often than b, we boost the update; if less often, we shrink it. This correction is called importance sampling.

Importance Sampling Ratio

The per-step importance sampling ratio corrects for the difference between target and behavior policies:

ρtπ(AtSt)b(AtSt)\rho_t \doteq \frac{\pi(A_t | S_t)}{b(A_t | S_t)}

What Each Term Means:

π(A|S)= probability target policy π would take action A in state S
b(A|S)= probability behavior policy b takes action A in state S
ρt= how much more (or less) likely π would do what b did

Understanding ρ Through Examples

ρ = 1

π and b agree perfectly on this action. No correction needed.

π(A|S) = b(A|S)

ρ = 2

π would do this 2× more often than b. Boost the update!

π(A|S) = 0.4, b(A|S) = 0.2

ρ = 0

π would never take this action. Ignore the update completely!

π(A|S) = 0 (deterministic π)

Semi-gradient Off-policy TD(0)

The off-policy version of semi-gradient TD multiplies by the importance sampling ratio:

wt+1=wt+αρtδtv^(St,wt)\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha \rho_t \delta_t \nabla \hat{v}(S_t, \mathbf{w}_t)

where δt = Rt+1 + γv̂(St+1, w) − v̂(St, w) is the TD error.

Why only ρt?

Because TD bootstraps! We only need to correct for the one action we took. The next state's value v̂(St+1) is already our estimate regardless of what action led there. This is simpler than Monte Carlo's product of all ρ's!

Action-Value Methods: Expected Sarsa & Q-learning

For action-value functions, we have more options for the target:

Semi-gradient Expected Sarsa
ww+αδtq^(St,At,w)\mathbf{w} \leftarrow \mathbf{w} + \alpha \delta_t \nabla \hat{q}(S_t, A_t, \mathbf{w})

where δt = R + γ Σa π(a|S') q̂(S', a) − q̂(S, A)

No ρ needed — expectation covers all actions!

Semi-gradient Q-learning
ww+αδtq^(St,At,w)\mathbf{w} \leftarrow \mathbf{w} + \alpha \delta_t \nabla \hat{q}(S_t, A_t, \mathbf{w})

where δt = R + γ maxa q̂(S', a) − q̂(S, A)

Also no ρ — target is greedy max!

Convergence Not Guaranteed!

Unlike tabular methods, semi-gradient off-policy TD with function approximation can diverge — weights can grow to infinity! This includes Q-learning with linear function approximation. The next sections explain why and what to do about it.

Section 11.2

Examples of Off-policy Divergence

When semi-gradient methods go wrong.

When values explode!

Here's the scary part: even simple problems with linear function approximation can cause semi-gradient TD to diverge to ±∞. Baird's counterexample is the most famous demonstration. It's not a bug — it's a fundamental incompatibility between bootstrapping, off-policy updates, and function approximation.

Baird's Counterexample

A 7-state MDP with linear function approximation where semi-gradient TD(0) diverges:

The Setup
States:7 states, with 6 "top" states and 1 "bottom" state
Actions:Dashed (to top states) or Solid (to bottom state 7)
Behavior b:Selects dashed uniformly at random (stays in top states)
Target π:Always selects solid action (always goes to state 7)
Rewards:All zero! γ = 0.99
What Happens

True values: vπ(s) = 0 for all states (no rewards ever).

But with semi-gradient TD(0), the weights diverge to infinity!

The problem: function approximation links the states, so updating one affects others in ways that compound catastrophically under off-policy distribution.

Why Does This Happen?

1

Off-policy distribution mismatch

We update states 1-6 often (behavior policy), but π goes straight to state 7. The distribution we update from ≠ the distribution we care about.

2

Function approximation couples states

Linear features share weights, so updating v̂(s1) changes v̂(s7) too!

3

Bootstrapping creates feedback loops

We update towards v̂(S'), which is also wrong. Errors compound!

Even On-Policy Can Diverge!

Tsitsiklis and Van Roy (1997) showed divergence even for on-policy semi-gradient TD with nonlinear function approximation (neural networks)! The problem isn't just off-policy.

Key insight: Divergence can occur whenever the update distribution doesn't match what the algorithm "expects." Off-policy makes this worse, but it's fundamentally about the interaction of bootstrapping + function approximation.

For the Curious Mind: Why Tabular Q-Learning Is Safe

Tabular Q-learning converges even off-policy because each state-action pair is updated independently. There's no coupling through shared weights. The "function approximation" in tabular methods is a lookup table where each entry is independent!

Fun fact: You can view tabular as linear function approximation with one-hot features. Each state gets its own dedicated weight, so there's no interference.

Section 11.3

The Deadly Triad

Three ingredients that together spell danger.

The three horsemen of instability!

The deadly triad names the three elements that, when combined, can cause instability and divergence. Remove any one, and things work fine. But all three together? Watch out! Understanding this helps us design stable algorithms.

The Deadly Triad

Instability and divergence can occur when all three are present:

Function Approximation

Generalizing across states (neural nets, linear features, etc.)

Bootstrapping

Updating towards estimates that include existing estimates (TD, not MC)

Off-policy Training

Learning about policy π from data generated by different policy b

Remove One Element → Safety!

No Function Approximation

Tabular methods! Q-learning, SARSA with tables all converge. Each state is independent.

No Bootstrapping

Monte Carlo methods! Wait for complete returns. No dependence on current estimates. Gradient descent on true returns converges.

No Off-policy

On-policy methods! Semi-gradient Sarsa, TD(λ) with same policy converge (with linear approximation at least).

Why Do We Want All Three?

Function Approximation

Essential for large state spaces — can't have a table for every pixel combination!

Bootstrapping

Much faster learning than MC. No need to wait for episode ends. Lower variance.

Off-policy

Experience replay, learning from demonstrations, parallel data collection, exploration!

The challenge: Modern deep RL (DQN, etc.) uses all three! So we need special techniques to make it work — target networks, experience replay sampling, clipping, and the methods in this chapter.

The Geometry of Divergence

Divergence happens when updates systematically push weights in directions that increase errors on states we don't visit (under behavior policy), creating a feedback loop. The "hill" we're climbing gets steeper and steeper.

The next sections develop the math to understand this precisely and find algorithms that do converge!

Section 11.4

Linear Value-function Geometry

Understanding value functions as vectors in high-dimensional space.

Value functions live in a geometric space!

Think of every possible value function as a point in a high-dimensional space (one dimension per state). Linear function approximation restricts us to a subspace — a flat sheet within this space. This geometric view helps us understand what TD is really doing!

Value Functions as Vectors

A value function over |S| states can be written as a vector:

v=[v(s1),v(s2),,v(sS)]RSv = [v(s_1), v(s_2), \ldots, v(s_{|S|})]^\top \in \mathbb{R}^{|S|}

The true value function vπ is one specific point. Function approximation constrains us to a subset of this space.

The Linear Subspace

With linear function approximation v̂(s) = wTx(s), we can only represent value functions that are linear combinations of our features.

v^w=Xw\hat{v}_\mathbf{w} = \mathbf{X}\mathbf{w}

Where:

  • X is the |S| × d feature matrix (each row is features for one state)
  • w is the d-dimensional weight vector
  • • The set of all achievable v̂w forms a d-dimensional subspace
Projection Operator

The projection Π finds the closest point in our representable subspace:

Πv=argminvsubspacevvμ2\Pi v = \arg\min_{v' \in \text{subspace}} \| v - v' \|^2_\mu

where ||·||μ is the norm weighted by state distribution μ.

Intuition:

If you can't represent vπ exactly, find the best approximation you can represent. Like projecting a 3D point onto a 2D plane — you lose some information but get as close as possible.

The Bellman Operator

The Bellman operator Bπ takes any value function and produces a "better" one:

(Bπv)(s)=aπ(as)s,rp(s,rs,a)[r+γv(s)](B_\pi v)(s) = \sum_a \pi(a|s) \sum_{s',r} p(s',r|s,a) [r + \gamma v(s')]
Fixed Point

vπ = Bπvπ

The true value function is unchanged by Bπ

Problem

Bπw usually lands outside our subspace!

Can't represent Bπv̂ with our features

The Projected Bellman Equation

TD methods try to find the fixed point of the projected Bellman operator:

vw=ΠBπvwv_\mathbf{w} = \Pi B_\pi v_\mathbf{w}

Apply Bellman, then project back to representable subspace. The TD fixed point is where this cycle stabilizes.

Geometric Picture

Imagine a 2D subspace (blue plane) in 3D value-function space:

  • vπ: true value (may be off the plane)
  • Πvπ: projection onto plane (best we can do)
  • TD fixed point: ΠBπ iteration converges here

On-policy: TD finds a point on the plane that's close to vπ (usually).

Off-policy: The projection uses wrong distribution μ, and ΠBπ may not be a contraction! Can spiral outward → divergence.

Section 11.5

Gradient Descent in the Bellman Error

A natural idea that doesn't quite work.

What if we just minimize the Bellman error?

The TD error measures how far we are from satisfying the Bellman equation. Natural idea: do gradient descent on this error! This section shows why the obvious approach doesn't work, leading us to better solutions.

Bellman Error

The Bellman error at state s measures violation of the Bellman equation:

δˉw(s)=Eπ[Rt+1+γv^(St+1,w)v^(St,w)St=s]\bar{\delta}_\mathbf{w}(s) = \mathbb{E}_\pi \big[ R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}) - \hat{v}(S_t, \mathbf{w}) \mid S_t = s \big]

This is the expected TD error from state s. If v̂ = vπ, this would be zero for all states.

Mean Squared Bellman Error (MSBE)

Average the squared Bellman error over the state distribution:

BE(w)=sμ(s)δˉw(s)2\overline{\text{BE}}(\mathbf{w}) = \sum_s \mu(s) \bar{\delta}_\mathbf{w}(s)^2

Minimizing MSBE seems natural — make the value function satisfy Bellman everywhere!

The Problem: Double Sampling

To compute the gradient of MSBE, we need:

BE(w)=2sμ(s)δˉw(s)δˉw(s)\nabla \overline{\text{BE}}(\mathbf{w}) = 2 \sum_s \mu(s) \bar{\delta}_\mathbf{w}(s) \nabla \bar{\delta}_\mathbf{w}(s)

But δ̄w(s) involves an expectation over next states, and so does ∇δ̄w(s)!

We need two independent samples of the next state from the same (s, a) to get an unbiased estimate. Usually we only have one!

Naive Residual Gradient: Biased!

If we naively use the same sample for both parts:

ww+αδt(v^(St)γv^(St+1))\mathbf{w} \leftarrow \mathbf{w} + \alpha \delta_t \big( \nabla \hat{v}(S_t) - \gamma \nabla \hat{v}(S_{t+1}) \big)

This is the residual gradient algorithm. It converges... but to the wrong answer! The bias from using one sample instead of two causes systematic error.

When Double Sampling Is Possible

Deterministic Environment

If S' is deterministic given (S, A), we effectively have infinite samples! Same S' every time.

Model Available

If we have a model, we can sample two independent next states from (S, A).

With double sampling, minimizing MSBE is valid. But there's another issue...

Section 11.6

The Bellman Error is Not Learnable

A deeper problem with MSBE as an objective.

MSBE is fundamentally flawed!

Even if we could minimize MSBE perfectly, it turns out to be the wrong objective. Different MDPs can produce identical data but have different optimal MSBE solutions! The true Bellman error is "not learnable" from data alone.

The Learnability Problem

Consider two different MDPs that produce exactly the same data distribution:

MDP A: From state s, action a leads deterministically to s'

MDP B: From state s, action a leads to s' or s'' with 50% each, but we only ever see s' (by chance or we stop early)

Same data, but different Bellman errors! We can't tell them apart from experience alone.

Observational Equivalence

The Bellman error depends on the transition dynamics of the MDP, not just the data distribution. But from finite samples, we can't distinguish:

What we observe: Sequences of (S, A, R, S') tuples

What BE needs: The full transition distribution p(s'|s,a)

This means MSBE is not a valid learning objective for model-free RL from experience!

The Solution: Projected Bellman Error (PBE)

Instead of MSBE, we should minimize the Mean Squared Projected Bellman Error (MSPBE):

PBE(w)=ΠBπv^wv^wμ2\overline{\text{PBE}}(\mathbf{w}) = \| \Pi B_\pi \hat{v}_\mathbf{w} - \hat{v}_\mathbf{w} \|^2_\mu

The projection Π makes this learnable! We don't need full dynamics — just the projected version, which depends only on observable quantities.

Why MSPBE Works

  • Learnable: Can be estimated from samples without double sampling
  • Same fixed point: MSPBE = 0 at exactly the TD fixed point
  • Unique minimum: Convex in the linear case, guaranteed convergence

For the Curious Mind: Why the Projection Helps

The projection operator Π is defined by our features and state distribution. It turns the unknowable "true Bellman error" into something that depends only on expectations we can estimate from data.

Mathematically: ΠBv depends on E[x(S)x(S)T] and E[x(S)(R + γv(S'))], both of which are expectations over observable quantities!

Section 11.7

Gradient-TD Methods

Stable algorithms that minimize MSPBE.

Finally — algorithms that work!

Gradient-TD methods do true stochastic gradient descent on MSPBE. They're guaranteed to converge even with off-policy data and function approximation! The key insight: use a second set of weights to estimate intermediate quantities.

MSPBE Objective

Gradient-TD minimizes the Mean Squared Projected Bellman Error:

PBE(w)=(XD(XwBπXw))(XDX)1(XD(XwBπXw))\overline{\text{PBE}}(\mathbf{w}) = \big( \mathbf{X}^\top \mathbf{D} (\mathbf{X}\mathbf{w} - B_\pi \mathbf{X}\mathbf{w}) \big)^\top (\mathbf{X}^\top \mathbf{D} \mathbf{X})^{-1} \big( \mathbf{X}^\top \mathbf{D} (\mathbf{X}\mathbf{w} - B_\pi \mathbf{X}\mathbf{w}) \big)

This looks scary! But the gradient can be estimated from samples using a clever trick.

GTD2 Algorithm

Gradient TD with gradient correction, using auxiliary weights v:

Update weights w:

wt+1=wt+αρt(xtγxt+1)(xtvt)\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha \rho_t \big( \mathbf{x}_t - \gamma \mathbf{x}_{t+1} \big) (\mathbf{x}_t^\top \mathbf{v}_t)

Update auxiliary weights v:

vt+1=vt+βρt(δtxtvt)xt\mathbf{v}_{t+1} = \mathbf{v}_t + \beta \rho_t \big( \delta_t - \mathbf{x}_t^\top \mathbf{v}_t \big) \mathbf{x}_t

What does v do?

v estimates the "direction of the Bellman error" in feature space. It's like computing an expected TD error, but projected onto our features. This lets us compute the MSPBE gradient without matrix inversions!

TDC (TD with Gradient Correction)

An equivalent formulation that's closer to standard TD:

Weight update:

wt+1=wt+αρt(δtxtγ(xtvt)xt+1)\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha \rho_t \big( \delta_t \mathbf{x}_t - \gamma (\mathbf{x}_t^\top \mathbf{v}_t) \mathbf{x}_{t+1} \big)
Same v update as GTD2

Intuition:

The first term (δtxt) is regular semi-gradient TD! The second term is a correction that fixes the bias from off-policy bootstrapping. Hence "TD with Gradient Correction".

Semi-gradient TD vs Gradient-TD

PropertySemi-gradient TDGradient-TD
Off-policy convergenceCan divergeGuaranteed
ComputationO(d)O(d) with extra v vector
HyperparametersOne (α)Two (α, β)
Objective minimizedNone (follows gradient)MSPBE
Action-Value Versions

GTD2 and TDC have natural extensions to action values q(s, a). GQ(λ) is the eligibility-trace version for control. These are the foundations for stable off-policy deep RL!

Section 11.8

Emphatic-TD Methods

Reweighting updates to fix the distribution mismatch.

What if we could fix the distribution?

Off-policy divergence happens because we update states according to the behavior distribution, but need estimates for the target distribution. Emphatic-TD reweights updates to simulate the correct distribution, making even semi-gradient TD convergent!

The Emphasis Idea

Give more weight to states that would be visited more often under the target policy:

wt+1=wt+αMtρtδtv^(St,wt)\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha M_t \rho_t \delta_t \nabla \hat{v}(S_t, \mathbf{w}_t)

The new term Mt is the emphasis, which accumulates the importance of updating at the current state.

Emphatic Weightings

The emphasis Mt follows a recursive update:

Mt=γρt1Mt1+ItM_t = \gamma \rho_{t-1} M_{t-1} + I_t

What Each Term Means:

Mt= emphasis (how much to weight this update)
It= interest (how much we care about state St, often 1)
ρt-1= importance ratio from previous step
γ= discount factor (decays emphasis over time)

Intuition: Accumulated Importance

Mt tracks how "responsible" the target policy is for reaching state St:

High Mt

Target policy π would visit this state often. Give updates more weight!

Low Mt

Target policy rarely visits here (we got here via behavior policy). Downweight to avoid distorting estimates!

Convergence Guarantee

Emphatic TD(λ) with linear function approximation converges to the TD fixed point under off-policy training, assuming:

  • • Behavior policy b covers target π (b(a|s) > 0 wherever π(a|s) > 0)
  • • Step size α decreases appropriately
  • • Interest It is bounded

The Variance Trade-off

Emphatic weighting can have high variance:

  • Mt is a product of importance ratios, which can grow large
  • Large Mt means large updates → high variance in learning
  • May need smaller step sizes or variance reduction techniques
Section 11.9

Reducing Variance

Making off-policy learning practical.

Making importance sampling practical!

Off-policy methods multiply by importance ratios ρ, which can be huge. This causes high variance that slows learning. This section covers techniques to reduce variance while keeping the algorithms correct.

The Variance Problem

Products of importance ratios can explode:

ρt:T=k=tTπ(AkSk)b(AkSk)\rho_{t:T} = \prod_{k=t}^{T} \frac{\pi(A_k|S_k)}{b(A_k|S_k)}

If each ρ ≈ 2, then over 10 steps: ρt:T ≈ 210 = 1024! Updates become extremely noisy.

Variance Reduction Techniques

Truncated Importance Sampling

Clip ρ to a maximum value (like 1.0). Introduces bias but dramatically reduces variance.

ρˉt=min(ρt,c)\bar{\rho}_t = \min(\rho_t, c)
Weighted Importance Sampling

Normalize by the sum of importance ratios. Biased but bounded variance. No explosive growth!

Control Variates

Subtract a baseline that has the same expectation but lower variance. Classic variance reduction from statistics!

Adaptive Methods

Automatically balance bias and variance based on observed ratios. Tree Backup, Retrace(λ), and V-trace from deep RL.

Modern Solutions: Retrace(λ) and V-trace

These methods adaptively truncate importance ratios to balance bias-variance:

Retrace(λ)

Uses truncated ratios cs = λ min(1, ρs). Safe off-policy learning with eligibility traces.

V-trace (IMPALA)

Clips both ρ and c separately. Powers distributed training in IMPALA with highly off-policy data.

Practical Trade-offs

In practice, some bias is often acceptable for massive variance reduction. Deep RL systems like DQN, IMPALA, and R2D2 all use truncated importance sampling. The theory says "biased" but experiments show it works great!

Section 11.10

Summary

Key takeaways from off-policy learning with approximation.

Chapter 11 Key Takeaways

!

The Deadly Triad causes instability

Function approximation + bootstrapping + off-policy = potential divergence. Remove any one element and you're safe.

2

MSBE is not learnable

The true Bellman error depends on transition dynamics, not just data. Use MSPBE instead!

3

Gradient-TD methods are stable

GTD2 and TDC do true gradient descent on MSPBE, guaranteeing convergence even off-policy.

4

Emphatic-TD reweights updates

By emphasizing states according to target policy visitation, even semi-gradient TD becomes convergent.

5

Variance is the practical bottleneck

Importance sampling products can explode. Truncation, weighted IS, and adaptive methods (V-trace, Retrace) make off-policy practical.

Formula Cheat Sheet

Importance Sampling Ratio

ρt=π(AtSt)b(AtSt)\rho_t = \frac{\pi(A_t|S_t)}{b(A_t|S_t)}

Semi-gradient Off-policy TD

wt+1=wt+αρtδtv^(St,w)\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha \rho_t \delta_t \nabla \hat{v}(S_t, \mathbf{w})

TDC Update

wt+1=wt+αρt(δtxtγ(xtvt)xt+1)\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha \rho_t \big( \delta_t \mathbf{x}_t - \gamma (\mathbf{x}_t^\top \mathbf{v}_t) \mathbf{x}_{t+1} \big)

Emphatic Weighting

Mt=γρt1Mt1+ItM_t = \gamma \rho_{t-1} M_{t-1} + I_t

Quick Glossary

Importance Sampling

Reweighting samples from one distribution to estimate expectations under another

Deadly Triad

Function approximation + bootstrapping + off-policy training

MSPBE

Mean Squared Projected Bellman Error — the right objective for TD

Gradient-TD

True gradient descent methods (GTD2, TDC) that converge off-policy

Emphatic-TD

Methods that reweight updates by accumulated importance

Baird's Counterexample

Famous MDP showing semi-gradient TD divergence

Common Mistakes to Avoid

Assuming Q-learning always works: With function approximation, even Q-learning can diverge! Need special care.

Ignoring importance sampling variance: Products of ρ can explode. Always consider truncation or weighted IS in practice.

Confusing MSBE and MSPBE: MSBE is not learnable from data! MSPBE is the correct objective for model-free learning.

Forgetting the auxiliary vector: Gradient-TD methods need both w and v. Don't skip the v updates!