Chapter 9

On-policy Prediction with Approximation

Scale RL to large (even continuous) state spaces using function approximation. Learn to generalize from limited experience.

TL;DR — The 30-Second Version

Problem: Tables don't work when you have millions of states.
Solution: Use a function v̂(s, w) with learnable weights w.
How: Gradient descent — take small steps to reduce error.
Options: Linear functions (simple, guaranteed to work) or neural nets (powerful but finicky).
Key Win: Learning about one state helps us predict values for similar states we've never seen!

What if there are too many states to remember them all?

So far, we've stored a separate value for each state — a table. But what if you have millions of states? Or continuous states like positions and velocities? The solution: function approximation. Instead of memorizing every value, we learn a function (like a neural network) that can estimate values for ANY state — even ones we've never seen before!

Think of it Like This: GPS vs. Memorizing Every Road

Tabular RL is like memorizing the exact driving time from every address to every other address in your city. For a small town with 100 locations, that's 10,000 entries — doable! But for a big city with 1 million locations? That's 1 trillion entries. Impossible!

Function Approximation is like having a GPS that learned general rules: "highways are faster," "rush hour is slow," "shorter distance usually means less time." The GPS has never been to YOUR exact location before, but it can estimate your drive time using these learned patterns!

Key insight: Instead of memorizing every answer, we learn the patterns that let us guess good answers for new situations.

Generalization

Learn once, apply everywhere

Neural Networks

Deep RL foundations

Feature Engineering

Tile coding, RBFs, & more

The Big Picture: From Tables to Functions

Tabular Methods (Part I)

Store V(s) or Q(s,a) in a table. One entry per state (or state-action pair).

Problem: Doesn't scale to large/continuous spaces!

Function Approximation (Part II)

Learn v̂(s, w) where w is a vector of parameters (weights).

Benefit: Generalize to unseen states!

This chapter's focus: Prediction — estimating vπ(s) using function approximation. Same policy throughout (on-policy).

See the Difference: Table vs. Function

Tabular Method

Every state has its own box

8.5
5.2
4.7
3.7
1.1
2.7
0.7
6.5
7.6
0.5
5.1
1.9
2.9
3.4
1.7
2.2
8.8
7.5
0.2
7.8
2.9
5.6
2.4
6.4
0.4

25 states = 25 values to store

1 million states = 1 million values!

∞ states = impossible!

Function Approximation

Learn a formula with just a few numbers

Statef(state)Value
w₁=2.3
w₂=1.7
w₃=0.9

Only 3 weights to learn!

3 weights work for ALL states

Even states we've never seen!

Works with continuous states too!

The Magic: Instead of storing millions of values, we learn a small set of weights that can compute the value for any state!

How Does the Learning Actually Work?

1

See State

Agent observes current state s

2

Make Guess

Compute v̂(s) using current weights

3

Get Feedback

See actual reward & next state

4

Find Error

Compare guess vs. reality

5

Adjust Weights

Nudge weights to reduce error

Repeat thousands of times → weights get better and better!

Gradient Descent: Rolling Downhill

Finding the best weights is like finding the lowest point in a valley

The Error Landscape

Error

╱╲ ╱╲ ← High error (bad weights)

╱ ╲ ╱ ╲

╱ ╲ ╱ ╲

╲╱ ← Low error (good weights!)

───────┴───────→ Weights

Each point represents different weight values.
Height = how wrong our predictions are.

How We Find the Bottom

1

Start Somewhere

Random weights = probably high on a hill

2

Feel the Slope

Gradient tells us which way is "down"

3

Take a Small Step

Move weights slightly downhill (α controls step size)

4

Repeat!

Eventually reach the valley bottom = best weights

Key Terms You'll Need to Know

What is a "Weight" or "Parameter"?

A weight (written as w) is a number that our algorithm learns and adjusts. Think of it like a "dial" or "knob" that controls how the function behaves.

Example: If predicting house prices, one weight might control "how much does an extra bedroom add to price?" If w₁ = 50,000, then each bedroom adds $50,000 to our estimate.

What is a "Feature"?

A feature (written as x) is a number that describes something about a state. It's the "input" to our function that we use to make predictions.

Example: In a video game, features might be: distance to enemy (23 pixels), health remaining (75%), ammo count (12 bullets). These numbers describe the current state.

What is a "Gradient"?

The gradient (written as ∇) tells you which direction is "downhill" — the direction that reduces your error the fastest.

Analogy: Imagine you're blindfolded on a hilly landscape trying to find the lowest point. The gradient is like feeling which way the ground slopes under your feet — it points downhill!

What is a "Dot Product"?

A dot product (w⊤x) multiplies matching pairs and adds them up: w₁×x₁ + w₂×x₂ + ... = single number.

Example: w = [2, 3] and x = [4, 5]. Dot product = 2×4 + 3×5 = 8 + 15 = 23

Section 9.1

Value-function Approximation

Why tables aren't enough, and what to do about it.

The curse of large state spaces!

Imagine trying to play backgammon. There are about 1020 possible positions. Even if you visited a million states per second, it would take longer than the age of the universe to see them all once! We need a way to generalize — learn something useful from one state and apply it to similar states we've never seen.

Function Approximation

Instead of storing a separate value for each state, we learn an approximate value function:

v^(s,w)vπ(s)\hat{v}(s, \mathbf{w}) \approx v_\pi(s)

Term Breakdown:

v̂(s, w)= our approximation of the value of state s
w= weight vector (the parameters we learn), typically |w| << |S|
vπ(s)= true value of state s under policy π

Why Function Approximation?

Scalability
  • • Handle millions or infinite states
  • • Work with continuous state spaces
  • • Memory grows with parameters, not states
Generalization
  • • Learn from one state, apply to similar states
  • • Make reasonable predictions for unseen states
  • • This is the key power of approximation!
Approximation as Supervised Learning

We can view each update as a training example for supervised learning:

Input: State s
Target: Some estimate of vπ(s) (like a return or TD target)

But there's a crucial difference from typical supervised learning: the target values are nonstationary (they change as we learn) and we don't have the luxury of multiple passes over a fixed training set.

For the Curious Mind

This is the Foundation of Deep RL!

When you hear about "Deep RL" or "DQN" — they're using neural networks as function approximators. The concepts in this chapter (especially gradient descent on value functions) are exactly what powers AlphaGo, OpenAI Five, and modern game-playing AI!

The Deadly Triad Warning

Combining function approximation + bootstrapping + off-policy learning can cause divergence — the weights explode to infinity! This chapter focuses on on-policy learning, which is safer. We'll address the challenges in later chapters.

Section 9.2

The Prediction Objective (V̄E)

What exactly are we trying to minimize?

You can't be perfect everywhere!

With function approximation, we usually have fewer parameters than states. This means we can't get the value exactly right for every state — we have to make tradeoffs. Some states will have more error than others. The question is: which states should we prioritize?

Mean Squared Value Error (V̄E)

Our objective is to minimize the weighted mean squared error between our approximation and the true value:

VE(w)sSμ(s)[vπ(s)v^(s,w)]2\overline{VE}(\mathbf{w}) \doteq \sum_{s \in \mathcal{S}} \mu(s) \big[ v_\pi(s) - \hat{v}(s, \mathbf{w}) \big]^2

Term Breakdown:

V̄E(w)= "Value Error" — the objective we minimize
μ(s)= weight for state s (how much we care about errors there)
[...]²= squared error between true and estimated value

What is μ(s)?

The weighting μ(s) represents how much we care about accuracy at each state. A natural choice: the on-policy distribution — the fraction of time spent in each state while following policy π.

Intuition: We visit some states more often than others. It makes sense to be more accurate in states we actually encounter frequently!

Note: For continuing tasks, μ is the stationary distribution. For episodic tasks, it's the distribution of states encountered in episodes.

The Approximation Tradeoff

In general, we cannot find w that gives zero error everywhere. We must make tradeoffs:

  • Reducing error in one state might increase error in another
  • The objective V̄E tells us which tradeoffs to make
  • States with higher μ(s) get more attention

For the Curious Mind

Why squared error and not absolute error?

Squared error is differentiable everywhere (important for gradient methods) and penalizes large errors more heavily. But other objectives like Huber loss are sometimes used in practice, especially when dealing with outliers!

Section 9.3

Stochastic Gradient and Semi-gradient Methods

The core algorithm for learning with approximation.

Gradient descent: roll downhill!

Imagine the error V̄E as a landscape of hills and valleys, where each point represents different weight values. We want to find the lowest point (minimum error). Gradient descentworks like rolling a ball downhill — we compute which direction is "down" (the gradient) and take a small step that way. Do this repeatedly and we reach a minimum!

Stochastic Gradient Descent (SGD)

Given a training example (state St with target value Ut), update weights:

wt+1=wt+α[Utv^(St,wt)]v^(St,wt)\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha \big[ U_t - \hat{v}(S_t, \mathbf{w}_t) \big] \nabla \hat{v}(S_t, \mathbf{w}_t)

Term Breakdown:

α= step-size (learning rate), controls how big each step is
Ut= target value (e.g., Monte Carlo return Gt)
[...]= error: difference between target and current estimate
∇v̂= gradient: vector of partial derivatives ∂v̂/∂wi

Understanding the Gradient

The gradient ∇v̂(s, w) is a vector that tells us how sensitive the value estimate is to each weight. If we increase weight wi by a tiny amount, how much does v̂(s, w) change?

v^(s,w)=(v^(s,w)w1v^(s,w)w2v^(s,w)wd)\nabla \hat{v}(s, \mathbf{w}) = \begin{pmatrix} \frac{\partial \hat{v}(s, \mathbf{w})}{\partial w_1} \\ \frac{\partial \hat{v}(s, \mathbf{w})}{\partial w_2} \\ \vdots \\ \frac{\partial \hat{v}(s, \mathbf{w})}{\partial w_d} \end{pmatrix}

Intuition: The gradient points in the direction that most increases v̂. To decrease the error, we move in the direction of error × gradient.

Gradient Monte Carlo

When the target Ut = Gt (the actual return), we get gradient Monte Carlo:

wt+1=wt+α[Gtv^(St,wt)]v^(St,wt)\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha \big[ G_t - \hat{v}(S_t, \mathbf{w}_t) \big] \nabla \hat{v}(S_t, \mathbf{w}_t)

This is a true gradient method — it converges to a local minimum of V̄E. The return Gt is an unbiased estimate of vπ(St).

Worked Example: One SGD Update Step-by-Step

Let's trace through exactly what happens in ONE update. Suppose we have a simple linear function with 2 weights.

Setup:
Current weights: w = [2.0, 1.0]
State features: x(s) = [3, 4]
Target (true return): G = 15
Step size: α = 0.1
Step 1:Compute current estimate: v̂(s) = w⊤x = 2.0×3 + 1.0×4 = 10
Step 2:Compute error: δ = G - v̂(s) = 15 - 10 = +5 (we underestimated!)
Step 3:Gradient (for linear): ∇v̂ = x(s) = [3, 4]
Step 4:Update: w ← w + α × δ × ∇v̂ = [2.0, 1.0] + 0.1 × 5 × [3, 4]
Step 5:Calculate: [2.0, 1.0] + [1.5, 2.0] = [3.5, 3.0]

Result: New weights are [3.5, 3.0]. Now v̂(s) = 3.5×3 + 3.0×4 = 10.5 + 12 = 22.5. That's closer to 15... wait, we overshot! That's okay — with a smaller step size, we'd be more cautious and eventually converge.

Semi-gradient TD(0)

When we use a bootstrapped target (like TD), we get a semi-gradient method:

wt+1=wt+α[Rt+1+γv^(St+1,wt)v^(St,wt)]v^(St,wt)\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha \big[ R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_t) - \hat{v}(S_t, \mathbf{w}_t) \big] \nabla \hat{v}(S_t, \mathbf{w}_t)

Why "semi-gradient"?

The target Rt+1 + γv̂(St+1, w) also depends on w, but we ignore this dependence! We don't take the gradient through the target. This makes it not a true gradient of any objective function, but it works well in practice.

Worked Example: Semi-gradient TD(0) Update

Let's see how TD learning works step-by-step. The agent is in state S, takes an action, gets reward R, and arrives in state S'.

Setup:
Current weights: w = [1.0, 2.0]
Current state features: x(S) = [1, 0]
Next state features: x(S') = [0, 1]
Reward received: R = 5
Discount factor: γ = 0.9, Step size: α = 0.1
Step 1:Current value: v̂(S) = w⊤x(S) = 1.0×1 + 2.0×0 = 1.0
Step 2:Next value: v̂(S') = w⊤x(S') = 1.0×0 + 2.0×1 = 2.0
Step 3:TD Target = R + γ×v̂(S') = 5 + 0.9×2.0 = 6.8
Step 4:TD Error: δ = Target - v̂(S) = 6.8 - 1.0 = +5.8
Step 5:Update: w ← w + α×δ×x(S) = [1.0, 2.0] + 0.1×5.8×[1, 0] = [1.58, 2.0]

Notice: Only w₁ changed! That's because x(S) = [1, 0] — only feature 1 was "active" in state S. This is why linear methods with sparse features are efficient: we only update the weights for features that were present.

MC vs. TD with Approximation

Gradient MC
  • • True gradient method
  • • Converges to local minimum
  • • Unbiased but high variance
  • • Must wait for episode end
Semi-gradient TD
  • • Not true gradient (semi)
  • • Converges for linear (on-policy)
  • • Biased but lower variance
  • • Learns online, step-by-step

For the Curious Mind

The State Aggregation Special Case

State aggregation (grouping similar states together) is actually a special case of function approximation! Each group has one weight, and all states in that group share the same value. The gradient is 1 for the active group and 0 otherwise.

Section 9.4

Linear Methods

The simplest and most well-understood approximation.

The value is a weighted sum of features!

In linear function approximation, we represent each state as a vector of features, and the value is just the dot product of features with weights. Simple, interpretable, and guaranteed to converge (with TD on-policy)!

Linear Value Function

The approximate value is a linear combination of features:

v^(s,w)=wx(s)=i=1dwixi(s)\hat{v}(s, \mathbf{w}) = \mathbf{w}^\top \mathbf{x}(s) = \sum_{i=1}^{d} w_i x_i(s)

Term Breakdown:

x(s)= feature vector for state s, a vector of d numbers
xi(s)= the i-th feature of state s (could be position, velocity, etc.)
wi= weight for feature i (learned parameter)
d= number of features (dimension of feature vector)

Visual: How Linear Value Function Works

Features x(s)

x₁ = 3
position
x₂ = 2
velocity
x₃ = 1
energy
×

Weights w

w₁ = 2.0
w₂ = 1.5
w₃ = 0.5
=

Dot Product

3×2.0 = 6.0

2×1.5 = 3.0

1×0.5 = 0.5

Sum = 9.5

Value

v̂(s) = 9.5

The value is simply: (feature₁ × weight₁) + (feature₂ × weight₂) + (feature₃ × weight₃)

Why Linear is Special

For linear approximation, the gradient is simply the feature vector:

v^(s,w)=x(s)\nabla \hat{v}(s, \mathbf{w}) = \mathbf{x}(s)

This makes the update rule beautifully simple:

wt+1=wt+αδtx(St)\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha \delta_t \mathbf{x}(S_t)

Where δt = Rt+1 + γv̂(St+1, w) - v̂(St, w) is the TD error. We simply add the TD error times each feature to the corresponding weight!

Convergence of Linear TD(0)

For on-policy TD(0) with linear approximation, the weights converge to a unique solution:

wTD=A1b\mathbf{w}_{TD} = \mathbf{A}^{-1}\mathbf{b}

where A and b are expectations over the on-policy distribution. This "TD fixed point" is guaranteed to exist and be unique (under standard conditions).

Error Bound for Linear TD

The TD fixed point isn't optimal, but it's guaranteed to be close:

VE(wTD)11γminwVE(w)\overline{VE}(\mathbf{w}_{TD}) \leq \frac{1}{1-\gamma} \min_\mathbf{w} \overline{VE}(\mathbf{w})

The TD solution is at most 1/(1-γ) times worse than the best possible. For γ = 0.9, that's at most 10× the minimum error — not bad for getting updates online!

For the Curious Mind

Linear is Surprisingly Powerful!

With good feature engineering (like tile coding), linear methods can approximate very complex value functions. They're also interpretable — you can look at the weights and understand what features matter most! Many practical RL systems still use linear approximation for its simplicity and convergence guarantees.

Section 9.5

Feature Construction for Linear Methods

How to represent states as feature vectors.

Features are the foundation!

Linear methods can only learn what we give them to work with. If we use bad features, no amount of learning will help. If we use good features, learning is easy! This section explores different ways to construct useful features.

Think of Features Like a Questionnaire

Imagine you want to predict someone's health. You could ask different types of questions:

Raw: "What's your age?" (just the number: 35)

Binned: "Are you 20-30? 30-40? 40-50?" (yes/no for each bin)

Polynomial: "Age?" + "Age²?" (captures non-linear effects)

Same state, different feature representations:

Raw age:[35]
Binned:[0, 1, 0, 0, 0]
Polynomial:[35, 1225]

Different features let the linear model learn different patterns!

Visual Guide: Feature Construction Methods

Polynomial

x, x², x³...

Smooth curves

Fourier

sin, cos waves

Periodic patterns

Tile Coding

Grid regions

Local + fast

RBF

Gaussian bumps

Smooth local

Most popular for RL: Tile coding — simple, fast, and works great!

Polynomials

Use polynomial combinations of state variables: x, x², xy, x²y, etc.

x1(s)=s1, x2(s)=s12, x3(s)=s1s2,x_1(s) = s_1, \ x_2(s) = s_1^2, \ x_3(s) = s_1 s_2, \ldots

Good for smooth functions, but number of features grows exponentially with dimension.

Fourier Basis

Use sine and cosine functions at different frequencies.

xi(s)=cos(πcis)x_i(s) = \cos(\pi \mathbf{c}_i \cdot s)

Great for periodic patterns. Can represent any function with enough terms.

Coarse Coding

Cover the state space with overlapping regions (receptive fields). Each feature indicates whether the state is in that region:

xi(s)={1if s is in region i0otherwisex_i(s) = \begin{cases} 1 & \text{if } s \text{ is in region } i \\ 0 & \text{otherwise} \end{cases}

Key insight: The size and shape of regions controls generalization. Large regions → generalize broadly. Small regions → generalize locally.

Tile Coding (The Practical Workhorse)

Tile coding uses multiple overlapping grids (tilings). Each tiling partitions the state space into tiles. The feature vector indicates which tile is active in each tiling.

Example: With 8 tilings, each state activates exactly 8 features (one per tiling). The tilings are offset from each other, so nearby states share most tiles.

  • Constant memory: Fixed number of tilings × tiles per tiling
  • Fast updates: Only active tiles need updating
  • Good generalization: Nearby states share tiles

Pro tip: With n tilings, set α = α0/n so that the initial step size is α0 (since n features are active).

Worked Example: Tile Coding for Mountain Car

Imagine we're learning to drive a car up a mountain. The state has 2 dimensions: position (−1.2 to 0.5) and velocity (−0.07 to 0.07).

Setup: 2 Tilings, 4×4 tiles each

Tiling 1 (no offset)

T5

State falls in tile 5

Tiling 2 (offset by half tile)

T10

State falls in tile 10

State:position = -0.5, velocity = 0.02
Feature vector:x(s) = [0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0] (32 features, 2 are "on")
Value:v̂(s) = w₅ + w₁₀ (sum of weights for active tiles)

Key Insight: Even though there are 32 weights (16 tiles × 2 tilings), only 2 are active at any time! Updates are fast: just add δ to 2 weights. And nearby states share tiles, so learning generalizes automatically.

Radial Basis Functions (RBFs)

Instead of binary (0/1) features, RBFs give continuous values based on distance from centers:

xi(s)=exp(sci22σi2)x_i(s) = \exp\left(-\frac{\|s - c_i\|^2}{2\sigma_i^2}\right)

Each feature is a "bump" centered at ci with width σi. States near the center activate the feature strongly; distant states activate it weakly.

For the Curious Mind

The Kernel Trick Connection

RBFs are related to kernel methods in machine learning. The feature space implicitly defined by an RBF kernel is infinite-dimensional! This connects to the powerful ideas of kernel-based learning (covered later in this chapter).

Tile Coding in Practice

Rich Sutton provides free tile coding software at his website. It's been used successfully in robotics, game playing, and many other domains. Simple to implement and surprisingly effective!

Section 9.6

Selecting Step-Size Parameters Manually

How to choose α for stable learning.

Finding the Goldilocks zone!

The step-size α is crucial. Too large → unstable learning, values oscillate wildly. Too small → painfully slow learning. We need to find the "just right" zone.

Visual: The Effect of Step Size (α)

α Too Small
target

Learns too slowly

α = 0.001

α Just Right ✓
target

Converges smoothly

α = 0.1

α Too Large
target

Oscillates / diverges!

α = 1.0

Rule of thumb: Start with α = 0.1 and adjust. If unstable, decrease. If too slow, increase.

A Rule of Thumb for Tile Coding

For tile coding with n tilings, a good starting point is:

α=1n\alpha = \frac{1}{n}

This ensures that one update brings the value exactly to the target (for the case where all n active features are new). In practice, you may need to tune from there.

General Guideline

For linear methods, the step-size should satisfy:

α<1E[xx]\alpha < \frac{1}{\mathbb{E}[\mathbf{x}^\top \mathbf{x}]}

where the expectation is over states sampled from the on-policy distribution. This is the expected squared norm of the feature vector.

Dangers of Large Step-Sizes
  • Values can oscillate wildly around the true values
  • With bootstrapping (TD), weights can diverge to infinity!
  • Start conservative and increase if learning is too slow
Section 9.7

Nonlinear Function Approximation: Artificial Neural Networks

The power of deep learning for RL.

When linear isn't enough!

Linear methods require us to hand-craft good features. Neural networkscan learn their own features! They can approximate almost any function, making them incredibly powerful for complex problems. This is the "deep" in Deep RL!

Think of it Like This: A Factory Assembly Line

Linear methods are like a single worker who looks at raw materials and directly estimates the final product's value. Simple, but they can only recognize obvious patterns.

Neural networks are like a factory with multiple assembly lines:

Raw State
(pixels, positions)
Layer 1
(finds edges)
Layer 2
(finds shapes)
Output
(value: 42.5)

Each layer transforms the input into something more useful. The network automatically learns what features matter — we don't have to design them by hand!

Neural Network Value Function

A neural network computes v̂(s, w) by passing the state through layers of neurons:

Input layer: Raw state features s
Hidden layers: Learn representations (features) automatically
Output layer: The value estimate v̂(s)

Each layer applies: y = σ(Wx + b) where σ is an activation function (ReLU, tanh, etc.)

Visual: What a Neural Network Looks Like

Input

x₁
x₂
x₃

State

Hidden 1

h₁
h₂
h₃

Features

Hidden 2

h₄
h₅

Abstract

Output

v̂(s)

Value

Input

Raw state (position, velocity...)

Hidden Layers

Learned features (automatic!)

Output

Single value estimate

Deep RL Revolution

Neural networks enable RL to work directly with raw inputs like images:

Advantages
  • • Learn features automatically
  • • Handle high-dimensional inputs
  • • Universal function approximators
Challenges
  • • No convergence guarantees
  • • Sensitive to hyperparameters
  • • Sample inefficient
Training with Backpropagation

The gradient ∇v̂(s, w) is computed using backpropagation:

  1. 1.Forward pass: Compute v̂(s) by passing s through the network
  2. 2.Compute error: δ = target - v̂(s)
  3. 3.Backward pass: Propagate gradients backward through layers
  4. 4.Update weights: w ← w + α δ ∇v̂(s)

For the Curious Mind

DQN: The Breakthrough

DeepMind's DQN (2015) combined Q-learning with deep neural networks to play Atari games directly from pixels! Key innovations: experience replay (storing past transitions) and target networks (freezing the target for stability).

Beyond Feedforward Networks

Modern Deep RL uses many architectures: CNNs for images, RNNs/Transformers for sequential data, attention mechanisms for variable-length inputs. The principles from this chapter apply to all of them!

Section 9.8

Least-Squares TD

Finding the optimal weights directly.

Skip the iteration, solve directly!

Standard TD updates gradually. But for linear methods, we can compute the TD fixed point directly by solving a system of equations. This is Least-Squares TD (LSTD). It's more sample-efficient but also more computationally expensive.

LSTD Solution

LSTD computes the weight vector directly:

wTD=A^1b^\mathbf{w}_{TD} = \hat{\mathbf{A}}^{-1} \hat{\mathbf{b}}

Where:

A^=t=0T1xt(xtγxt+1)+εI\hat{\mathbf{A}} = \sum_{t=0}^{T-1} \mathbf{x}_t (\mathbf{x}_t - \gamma \mathbf{x}_{t+1})^\top + \varepsilon \mathbf{I}
b^=t=0T1Rt+1xt\hat{\mathbf{b}} = \sum_{t=0}^{T-1} R_{t+1} \mathbf{x}_t

Here xt = x(St) and εI is a small regularization term to ensure invertibility.

LSTD Tradeoffs

Advantages
  • • Makes maximum use of data
  • • No step-size parameter α
  • • Directly computes fixed point
  • • Better sample efficiency
Disadvantages
  • • O(d²) computation per step
  • • O(d²) memory for  matrix
  • • O(d³) for matrix inversion
  • • Only for linear approximation

For the Curious Mind

Incremental LSTD

We can maintain  and b̂ incrementally and use the Sherman-Morrison formula to update Â-1 directly, avoiding full matrix inversion. This gives O(d²) per step instead of O(d³), making LSTD practical for moderate-sized feature vectors.

Section 9.9

Memory-based Function Approximation

Lazy learning: just remember and look up.

Why learn when you can remember?

Memory-based (or lazy) methods don't learn a global function. Instead, they store all training examples and, when asked about a new state, look up the most similar stored examples and combine their values.

Nearest Neighbor and Local Averaging

Given a query state s, memory-based methods:

  1. 1.Find neighbors: Retrieve k stored examples closest to s
  2. 2.Combine values: Average (or weighted average) their stored values

Local vs. Global Learning

Global (Parametric)

Learn fixed parameters that work for all states. Every update affects the whole function.

Local (Memory-based)

Compute value on-the-fly from nearby examples. Each prediction is independent.

When Memory-based Works Well
  • Naturally handles nonstationary problems (old data can be forgotten)
  • No training required — just store examples
  • Can represent complex, locally varying functions
  • !Requires fast nearest-neighbor lookup for efficiency (kd-trees, etc.)
Section 9.10

Kernel-based Function Approximation

Elegant mathematics for similarity-based learning.

Measuring similarity between states!

A kernel k(s, s') measures how similar two states are. Similar states should have similar values! Kernel methods generalize memory-based approaches with elegant mathematical foundations.

Kernel-based Value Approximation

Given stored examples (si, vi), the value of a query state s is:

v^(s)=ik(s,si)vi\hat{v}(s) = \sum_{i} k(s, s_i) v_i

where k(s, si) is the kernel measuring similarity between s and si. Common choice: Gaussian kernel k(s, s') = exp(-||s-s'||²/2σ²).

The Kernel Trick

Kernels implicitly define a feature space — potentially infinite-dimensional! The kernel computes the inner product in this feature space without ever explicitly constructing the features. This gives the power of complex feature representations with the simplicity of linear methods.

Section 9.11

Interest and Emphasis

Focusing learning where it matters.

Not all states are equally important!

In some problems, we care more about accuracy in certain states. InterestIt lets us specify how much we care about each state. EmphasisMt propagates this importance back through time, so we learn accurate values for states that lead to important states.

Interest and Emphasis

Interest It ∈ [0, 1]: How much do we care about accuracy at state St?

Emphasis Mt: Accumulates interest over time:

Mt=It+γMt1M_t = I_t + \gamma M_{t-1}

The update is then weighted by Mt:

wt+1=wt+αMtδtv^(St,wt)\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha M_t \delta_t \nabla \hat{v}(S_t, \mathbf{w}_t)

When to Use Interest

Interest is useful when:

  • Some states are more decision-relevant than others
  • You want to focus learning on a subset of the state space
  • Different parts of the state space have different value to you

Quick Visual Summary: The Big Ideas

📊

Function Approximation

Use a function with learnable weights instead of a huge table

⬇️

Gradient Descent

Roll downhill to find weights that minimize error

🧩

Features Matter

Good features (tile coding, RBF) make learning easy

🧠

Generalization

Learning from one state helps predict values for similar states

The Core Update (What Happens Every Step)

wnew

New weights

=

wold

Current weights

+

α

Step size

×

error

How wrong?

×

∇v̂

Which way?

That's it! Every method in this chapter is just a variation of this simple idea.

Why This Actually Matters

This Is How Modern AI Works!

AlphaGo, ChatGPT,self-driving cars — all of them use function approximation. Neural networks (covered in 9.7) are the foundation of modern AI. When you hear "deep learning," it's just fancy function approximation with lots of layers! The gradient descent techniques in this chapter are literally how every major AI system is trained.

Real-World Applications

  • Video Games: DQN (2015) used neural nets to play Atari games from raw pixels — function approximation on 210×160×3 dimensional images!
  • Robotics: Robots learn smooth continuous movements using tile coding and neural nets — no lookup tables possible for infinite joint positions!
  • Recommendation Systems: Netflix, YouTube use learned value functions to predict which video you'll enjoy next.

Skills You'll Use

  • Gradient descent: The same technique trains ChatGPT. Understanding it here helps you understand all of deep learning!
  • Feature engineering: Choosing good representations is still crucial even with neural nets. This skill transfers to any ML job.
  • Convergence intuition: Understanding why some methods work and others diverge helps you debug real ML systems.

Bottom Line: Without function approximation, RL only works on tiny problems with a few hundred states. With function approximation, RL scales to solve problems with millions of states (like Go) or continuous states (like robot control). This chapter is the bridge from toy problems to real-world AI.

Section 9.12

Summary

Key takeaways from Chapter 9.

Quick Reference Glossary

Function Approximation

Representing v̂(s,w) with fewer parameters than states

V̄E (Mean Squared Value Error)

The objective: weighted sum of squared errors

SGD (Stochastic Gradient Descent)

Update w in direction of negative gradient

Semi-gradient

Ignoring gradient through bootstrapped target

Linear Methods

v̂(s,w) = w⊤x(s) — simple, convergent

Tile Coding

Multiple overlapping grids for features

RBF (Radial Basis Functions)

Gaussian bumps as continuous features

LSTD (Least-Squares TD)

Directly compute TD fixed point: w = A⁻¹b

Formula Cheat Sheet

Mean Squared Value Error (Objective):

VE(w)=sμ(s)[vπ(s)v^(s,w)]2\overline{VE}(\mathbf{w}) = \sum_{s} \mu(s) [v_\pi(s) - \hat{v}(s, \mathbf{w})]^2

SGD Update (General):

wt+1=wt+α[Utv^(St,wt)]v^(St,wt)\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha [U_t - \hat{v}(S_t, \mathbf{w}_t)] \nabla \hat{v}(S_t, \mathbf{w}_t)

Semi-gradient TD(0) Update:

wt+1=wt+α[Rt+1+γv^(St+1,w)v^(St,w)]v^(St,w)\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha [R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}) - \hat{v}(S_t, \mathbf{w})] \nabla \hat{v}(S_t, \mathbf{w})

Linear Value Function:

v^(s,w)=wx(s)=iwixi(s)\hat{v}(s, \mathbf{w}) = \mathbf{w}^\top \mathbf{x}(s) = \sum_i w_i x_i(s)

TD Fixed Point Error Bound:

VE(wTD)11γminwVE(w)\overline{VE}(\mathbf{w}_{TD}) \leq \frac{1}{1-\gamma} \min_\mathbf{w} \overline{VE}(\mathbf{w})

LSTD Solution:

w=A^1b^\mathbf{w} = \hat{\mathbf{A}}^{-1} \hat{\mathbf{b}}

Common Mistakes (Avoid These!)

Wrong: "Semi-gradient methods converge to the global optimum"

Right: Semi-gradient TD converges to the TD fixed point, which is not the minimum of V̄E. It's within 1/(1-γ) of optimal for linear methods.

Wrong: "More features always means better approximation"

Right: More features increases representational power but can lead to overfitting and slower learning. Need balance between expressiveness and generalization.

Wrong: "Neural networks always work better than linear methods"

Right: Neural networks are powerful but have no convergence guarantees with bootstrapping. Linear methods with good features are often competitive and more stable!

What We Learned in This Chapter

To handle large or continuous state spaces, we use function approximationinstead of tables. This lets us generalize from limited experience!

  • v̂(s,w) — Value is a function of parameters w
  • ∇v̂ — Gradient tells us how to improve
  • Linear methods — Simple, guaranteed convergence
  • Tile coding/RBFs — Practical feature construction
  • Neural networks — Learn features automatically (Deep RL!)

Function approximation enables RL to scale to large/continuous state spaces

The objective V̄E weights errors by how often states are visited (μ)

SGD updates weights in the direction that reduces error

Semi-gradient methods ignore the gradient through bootstrapped targets

Linear methods are simple, interpretable, and guaranteed to converge (on-policy)

The TD fixed point is within 1/(1-γ) of optimal for linear approximation

Tile coding is a practical and effective feature construction method

Neural networks can learn features automatically but lack convergence guarantees

LSTD directly computes the solution but has O(d²) cost

The Key Insight

Generalization is the power and the peril of approximation. By using fewer parameters than states, we force the function to generalize — learning about one state automatically affects similar states. This lets us handle huge state spaces, but also means we can never get the values exactly right everywhere. The art is in choosing representations that generalize in the right way!

Looking Ahead

This chapter covered on-policy prediction — estimating vπ while following π. Next, we'll extend these ideas to on-policy control (learning optimal policies with approximation), off-policy learning (the deadly triad!), and eligibility traces for more efficient credit assignment.