Chapter 10

On-policy Control with Approximation

Learn to find optimal policies in large state spaces using function approximation. From prediction to control!

From knowing values to taking action!

Chapter 9 taught us to estimate values with function approximation. Now we take the next step: control — actually learning what actions to take! We'll learn semi-gradient Sarsa, tackle the famous Mountain Car problem, and discover why we need a whole new way of thinking (average reward) for continuing tasks.

Mountain Car

Classic control benchmark

Semi-gradient Sarsa

Control with approximation

Average Reward

Beyond discounting

The Big Picture: From Prediction to Control

Chapter 9: Prediction

Estimate vπ(s) — the value of states under a fixed policy π.

Input: state → Output: value

Chapter 10: Control

Learn q̂(s, a, w) ≈ q*(s, a) — find the optimal policy!

Input: state + action → Output: value

Key insight: We need action values q(s,a) instead of state values v(s), because we need to compare actions to choose the best one!

Section 10.1

Episodic Semi-gradient Control

Semi-gradient Sarsa for action-value approximation.

Same idea, new target!

Remember semi-gradient TD from Chapter 9? We estimated v(s) by updating towards R + γv̂(S'). For control, we estimate q(s,a) by updating towards R + γq̂(S', A'). This is semi-gradient Sarsa with function approximation — the natural extension to control!

Semi-gradient Sarsa Update

The one-step semi-gradient Sarsa update for action-value approximation:

wt+1=wt+α[Rt+1+γq^(St+1,At+1,wt)q^(St,At,wt)]q^(St,At,wt)\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha \big[ R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t) - \hat{q}(S_t, A_t, \mathbf{w}_t) \big] \nabla \hat{q}(S_t, A_t, \mathbf{w}_t)

Term Breakdown:

q̂(S, A, w)= approximate action-value for state S, action A
Rt+1= reward received after taking action At
At+1= next action chosen (e.g., ε-greedy from q̂)
∇q̂= gradient of q̂ with respect to weights w

Generalized Policy Iteration (GPI) with Approximation

The same GPI pattern from tabular methods applies here:

Policy Evaluation

Update q̂(s, a, w) towards the target using semi-gradient Sarsa

Policy Improvement

Act ε-greedy with respect to q̂(s, ·, w)

Example: Mountain Car Task

A classic control problem that showcases the power of function approximation:

The Problem:

  • • Underpowered car in a valley
  • • Goal: reach the top of the hill
  • • Can't go directly up — must build momentum!
  • • Reward: -1 per step until goal

State & Actions:

  • State: (position, velocity) — continuous!
  • Actions: forward, reverse, coast
  • • Uses tile coding for features
  • • 8 tilings, linear function approximation

Key insight: The agent must first go away from the goal (up the left slope) to build enough momentum to reach the goal. This requires learning that sometimes things must get worse before they get better!

Episodic Semi-gradient Sarsa Algorithm

# Initialize q̂(s, a, w) for all s, a (e.g., w = 0)

Loop for each episode:

S, A ← initial state and action (ε-greedy)

Loop for each step:

Take A, observe R, S'

If S' is terminal:

w ← w + α[R - q̂(S,A,w)]∇q̂(S,A,w)

Go to next episode

A' ← ε-greedy from q̂(S', ·, w)

w ← w + α[R + γq̂(S',A',w) - q̂(S,A,w)]∇q̂(S,A,w)

S ← S', A ← A'

For the Curious Mind

Optimistic Initialization Helps!

In Mountain Car, all true values are negative (reward is -1 per step). Initializing q̂ to 0 is optimistic — the agent thinks unexplored states are great, which drives exploration even with ε=0! This is like the optimistic initial values trick from Chapter 2.

Continuous Actions?

Mountain Car has discrete actions, but what about continuous actions (like steering angle or throttle)? That's harder — we'd need to maximize q̂(s, a, w) over continuous a. Policy gradient methods (Chapter 13) handle this elegantly!

Section 10.2

Semi-gradient n-step Sarsa

Multi-step bootstrapping for control.

Look further ahead!

Just like n-step TD for prediction, we can use n-step returns for control. Instead of updating towards R + γq̂(S', A'), we use the sum of n rewards plus a bootstrapped estimate. This often learns faster than one-step!

n-step Return for Sarsa

The n-step return for action-value approximation:

Gt:t+n=Rt+1+γRt+2++γn1Rt+n+γnq^(St+n,At+n,w)G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n \hat{q}(S_{t+n}, A_{t+n}, \mathbf{w})

For t + n < T. If t + n ≥ T, use Gt:t+n = Gt (the full return).

The n-step Semi-gradient Sarsa Update

wt+n=wt+n1+α[Gt:t+nq^(St,At,wt+n1)]q^(St,At,wt+n1)\mathbf{w}_{t+n} = \mathbf{w}_{t+n-1} + \alpha \big[ G_{t:t+n} - \hat{q}(S_t, A_t, \mathbf{w}_{t+n-1}) \big] \nabla \hat{q}(S_t, A_t, \mathbf{w}_{t+n-1})

This update happens at time t+n for the estimate made at time t. We wait n steps to collect n rewards before updating.

Performance on Mountain Car

n-step methods often outperform one-step methods:

n = 1: ~200 steps per episode (after learning)

n = 8: ~130 steps per episode (much better!)

As in tabular case, intermediate n (like 4 or 8) often works best — balancing bias (from bootstrapping) and variance (from sampling).

For the Curious Mind

Why does n=8 work better than n=1?

In Mountain Car, the reward signal (-1 every step) doesn't distinguish good from bad actions until you reach the goal. With n=1, information about reaching the goal propagates slowly backward. With n=8, information spreads 8x faster! But too large n increases variance.

Section 10.3

Average Reward: A New Problem Setting

Beyond discounting for continuing tasks.

What if there's no end?

For continuing tasks (tasks that go on forever without episodes), the discounted return can be problematic. The average reward setting offers an alternative: instead of maximizing discounted sum, we maximize the average reward per time step. No discounting needed!

Average Reward

The quality of a policy π is defined as its average reward rate:

r(π)=limh1ht=1hE[RtS0,A0:t1π]r(\pi) = \lim_{h \to \infty} \frac{1}{h} \sum_{t=1}^{h} \mathbb{E}[R_t | S_0, A_{0:t-1} \sim \pi]

Term Breakdown:

r(π)= average reward per step when following policy π
h= time horizon (taken to infinity)
1/h Σ= average over all time steps

Goal: Find π that maximizes r(π). All policies with maximal r(π) are considered optimal.

Differential Returns

In the average reward setting, we define returns relative to the average:

Gt=(Rt+1r(π))+(Rt+2r(π))+(Rt+3r(π))+G_t = (R_{t+1} - r(\pi)) + (R_{t+2} - r(\pi)) + (R_{t+3} - r(\pi)) + \cdots

This is the differential return — rewards measured relative to what we expect on average. Positive means better than average; negative means worse.

Differential Value Functions

Differential value functions use differential returns:

vπ(s) = Eπ[Gt | St = s] = expected differential return from state s

qπ(s, a) = Eπ[Gt | St = s, At = a] = expected differential return from (s, a)

These tell us how much better (or worse) a state or action is compared to average.

Differential TD Error

The TD error for action values in the average reward setting:

δt=Rt+1Rˉ+q^(St+1,At+1,w)q^(St,At,w)\delta_t = R_{t+1} - \bar{R} + \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}) - \hat{q}(S_t, A_t, \mathbf{w})

is our estimate of the average reward r(π). Notice: no discount factor γ! In the average reward setting, we care equally about all future rewards.

Example: Access Control Queuing

A continuing task that illustrates the average reward setting:

  • 10 servers that can handle customer requests
  • 4 priority levels of customers (pay 1, 2, 4, or 8)
  • Decision: Accept or reject each customer?
  • • Servers become free randomly (p = 0.06 per step)
  • Goal: Maximize long-term average reward

The learned policy: reject low-priority customers when few servers are free, saving capacity for high-priority customers!

For the Curious Mind

Ergodic MDPs

The average reward setting works for ergodic MDPs — where the long-run state distribution doesn't depend on the starting state. In ergodic MDPs, early decisions have only temporary effects. Most practical continuing tasks satisfy this property.

Section 10.4

Deprecating the Discounted Setting

Why discounting fails with approximation.

A surprising result!

Here's a shocking fact: for continuing tasks with function approximation, the discount rate γ doesn't actually matter for ranking policies! Whether γ = 0.9 or γ = 0.99 or γ = 0 (!) — you'd end up preferring the same policies. The average of discounted returns is always proportional to the average reward.

The Futility of Discounting

For any policy π, the average of discounted returns equals:

J(π)=sμπ(s)vπγ(s)=r(π)1γJ(\pi) = \sum_s \mu_\pi(s) v_\pi^\gamma(s) = \frac{r(\pi)}{1 - \gamma}

This means the discount rate γ doesn't affect the ordering of policies! All values of γ give the same ranking — the one based on average reward r(π).

Intuition: Why Does This Happen?

In a continuing task with no start or end:

  • 1.Every time step is the same as every other (by symmetry)
  • 2.Every reward appears in every position in some discounted return
  • 3.Total weight on any reward: 1 + γ + γ² + ... = 1/(1-γ)
  • 4.So average discounted return = (1/(1-γ)) × average reward

Implication: For continuing control with approximation, γ becomes a solution method parameter (affecting learning dynamics), not a problem parameter (defining the objective).

The Deeper Problem: Loss of Policy Improvement

With function approximation, we've lost something fundamental:

The Policy Improvement Theorem No Longer Holds!

In tabular methods, improving the value of one state guarantees overall improvement. With approximation, improving one state might hurt another (they share parameters!). We can no longer guarantee that greedy policy improvement helps.

This is a fundamental challenge. Policy gradient methods (Chapter 13) offer one solution with their own "policy gradient theorem."

For the Curious Mind

Then why do we still use γ in practice?

Great question! Even though γ doesn't define the objective, it affects learning dynamics. Higher γ means longer-horizon credit assignment, which can be harder to learn but captures more long-term dependencies. Lower γ learns faster but may miss long-term consequences. It's now a hyperparameter to tune!

Section 10.5

Differential Semi-gradient n-step Sarsa

Putting it all together for continuing tasks.

The complete algorithm!

Now we combine everything: n-step returns + average reward + differential values. This gives us a principled algorithm for continuing control with function approximation that doesn't rely on the problematic discounted setting.

Differential n-step Return

The n-step differential return (with function approximation):

Gt:t+n=(Rt+1Rˉ)++(Rt+nRˉ)+q^(St+n,At+n,w)G_{t:t+n} = (R_{t+1} - \bar{R}) + \cdots + (R_{t+n} - \bar{R}) + \hat{q}(S_{t+n}, A_{t+n}, \mathbf{w})

Where R̄ is our current estimate of the average reward r(π). Notice: no γ terms — all rewards weighted equally!

Differential Semi-gradient n-step Sarsa

# Initialize w, R̄ = 0

# Initialize S₀, A₀

Loop for each step t = 0, 1, 2, ...:

Take At, observe Rt+1, St+1

Select At+1 (ε-greedy from q̂)

τ ← t - n + 1

If τ ≥ 0:

δ ← Σ(Ri - R̄) + q̂(Sτ+n, Aτ+n, w) - q̂(Sτ, Aτ, w)

R̄ ← R̄ + βδ

w ← w + αδ∇q̂(Sτ, Aτ, w)

Key addition: We update R̄ along with w. The step size β for R̄ should be small for stable average reward estimation.

Two Step Sizes: α and β

α (for weights w)

Controls how fast we update action-value estimates. Similar to usual learning rate.

β (for average R̄)

Controls how fast we update average reward estimate. Usually small for stability.

Section 10.6

Summary

Key takeaways from Chapter 10.

Quick Reference Glossary

Semi-gradient Sarsa

TD control with function approximation

Action-value Approximation

q̂(s, a, w) ≈ q*(s, a)

Average Reward r(π)

Expected reward per step under policy π

Differential Return

Sum of (R - r(π)) — rewards relative to average

Differential Value

Expected differential return — how much better than average

Ergodic MDP

Long-run distribution independent of start state

Formula Cheat Sheet

Semi-gradient Sarsa Update:

ww+α[R+γq^(S,A,w)q^(S,A,w)]q^(S,A,w)\mathbf{w} \leftarrow \mathbf{w} + \alpha [R + \gamma \hat{q}(S', A', \mathbf{w}) - \hat{q}(S, A, \mathbf{w})] \nabla \hat{q}(S, A, \mathbf{w})

n-step Return:

Gt:t+n=Rt+1+γRt+2++γn1Rt+n+γnq^(St+n,At+n,w)G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n \hat{q}(S_{t+n}, A_{t+n}, \mathbf{w})

Average Reward:

r(π)=limh1ht=1hE[Rt]r(\pi) = \lim_{h \to \infty} \frac{1}{h} \sum_{t=1}^{h} \mathbb{E}[R_t]

Differential TD Error:

δt=Rt+1Rˉ+q^(St+1,At+1,w)q^(St,At,w)\delta_t = R_{t+1} - \bar{R} + \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}) - \hat{q}(S_t, A_t, \mathbf{w})

Futility of Discounting:

sμπ(s)vπγ(s)=r(π)1γ\sum_s \mu_\pi(s) v_\pi^\gamma(s) = \frac{r(\pi)}{1-\gamma}

Common Mistakes (Avoid These!)

Wrong: "Higher γ always means better long-term planning"

Right: In continuing tasks with approximation, γ doesn't affect which policy is optimal! It only affects learning dynamics. Higher γ may make learning slower and harder.

Wrong: "Semi-gradient methods are true gradient methods"

Right: They're called "semi"-gradient because we ignore the gradient through the bootstrapped target. This is an approximation that works well in practice but has different convergence properties.

Wrong: "Policy improvement always improves with function approximation"

Right: The policy improvement theorem doesn't hold with function approximation! Improving one state might hurt others. This is a fundamental limitation addressed by policy gradient methods.

What We Learned in This Chapter

We extended function approximation from prediction to control — learning what actions to take in large or continuous state spaces!

  • q̂(s, a, w) — Approximate action values for control
  • Semi-gradient Sarsa — TD control with function approximation
  • Mountain Car — Classic benchmark solved with tile coding
  • Average reward — Better objective for continuing tasks
  • Differential values — Rewards relative to the average

Semi-gradient Sarsa extends TD control to function approximation

For control we need action values q(s,a), not just state values v(s)

n-step methods often outperform one-step (n=8 beats n=1 on Mountain Car)

For continuing tasks, discounting doesn't affect the optimal policy ranking

Average reward r(π) provides a meaningful objective for continuing tasks

Differential returns measure rewards relative to the average

The policy improvement theorem is lost with function approximation

γ becomes a solution method parameter, not a problem parameter

The Key Insight

Control with approximation requires rethinking our foundations.The jump from tabular to approximate methods isn't just a technical change — it forces us to abandon discounting for continuing tasks and accept that we've lost the policy improvement theorem. Despite these challenges, semi-gradient methods work remarkably well in practice, forming the backbone of modern deep RL!

Looking Ahead

This chapter covered on-policy control — learning while following the policy we're improving. But what if we want to learn from data generated by a different policy? That's off-policy learning, which brings its own challenges (the "deadly triad"). Chapter 11 tackles this important and tricky topic!