Chapter 13

Policy Gradient Methods

Learning Parameterized Policies Directly

Until now, we've learned value functions and derived policies from them. Policy gradient methods take a fundamentally different approach: they learn the policy directly. This enables handling continuous actions, learning stochastic policies, and often leads to more stable learning.

Section 13.1

Policy Approximation and its Advantages

Instead of learning action values and deriving a policy (like ε-greedy), we can directly learn a parameterized policy π(a|s,θ) that maps states to action probabilities. The parameter vector θ is what we learn.

Policy Gradient Methods

Methods that learn a parameterized policy π(a|s,θ) by adjusting θ in the direction of the gradient of some performance measure J(θ). The policy can select actions without consulting a value function, though value functions are often used to help learn the policy.

The Basic Update

Policy gradient methods perform gradient ascent on the performance measure:

θt+1=θt+αJ(θt)^\theta_{t+1} = \theta_t + \alpha \widehat{\nabla J(\theta_t)}

where the estimate approximates the true gradient of performance with respect to θ.

Soft-max Policy Parameterization

For discrete actions, a common approach is to learn action preferences h(s,a,θ) and convert them to probabilities using a soft-max:

π(as,θ)=eh(s,a,θ)beh(s,b,θ)\pi(a|s,\theta) = \frac{e^{h(s,a,\theta)}}{\sum_b e^{h(s,b,\theta)}}

The preferences can be linear: h(s,a,θ) = θ⊤x(s,a), or computed by a neural network.

Advantages Over Value-Based Methods

Approach Determinism

Soft-max policies can approach deterministic behavior, unlike ε-greedy which always has ε probability of random action.

Stochastic Optimal Policies

Can learn optimal stochastic policies (important in games with imperfect information like Poker where bluffing requires randomness).

Simpler Policy

Sometimes the policy is simpler than the value function. In these cases, policy-based methods learn faster.

Continuous Actions

Naturally handle continuous action spaces by learning parameters of a distribution (like mean and variance of a Gaussian).

Short Corridor Example

Consider a corridor where all states look identical under function approximation. ε-greedy must choose between "always go right" or "always go left." But the optimal policy might be "go right with probability 0.59." Only policy gradient methods can learn this specific probability.

Section 13.2

The Policy Gradient Theorem

A fundamental challenge: performance depends on both action selections AND the state distribution, both of which change when we update θ. How can we compute the gradient when the state distribution is unknown?

The Policy Gradient Theorem

The policy gradient theorem provides an analytical expression for ∇J(θ) that doesn't require the derivative of the state distribution. This is the theoretical foundation for all policy gradient methods.

Performance Measure

For episodic tasks, performance is the value of the start state:

J(θ)=vπθ(s0)J(\theta) = v_{\pi_\theta}(s_0)

The Theorem

The policy gradient theorem establishes:

J(θ)sμ(s)aqπ(s,a)π(as,θ)\nabla J(\theta) \propto \sum_s \mu(s) \sum_a q_\pi(s,a) \nabla \pi(a|s,\theta)

Breaking Down the Theorem

  • ∇J(θ)Gradient of performance—what we want to estimate
  • μ(s)On-policy state distribution—how often we visit each state
  • q_π(s,a)Action value—how good is action a in state s
  • ∇π(a|s,θ)Policy gradient—how to change θ to make action a more likely
Why This is Remarkable

The gradient involves the state distribution μ(s), but we don't need to know its derivative! If we follow the policy π, we naturally visit states according to μ. So we can sample from this expression without ever computing the distribution explicitly.

Smooth Updates

Policy gradient methods have better convergence properties than value-based methods because:

  • Action probabilities change smoothly as θ changes
  • No sudden policy changes from small value estimate changes
  • Guaranteed improvement for small enough step sizes
Section 13.3

REINFORCE: Monte Carlo Policy Gradient

REINFORCE is the simplest policy gradient algorithm. It uses complete episode returns to estimate the gradient, making it a Monte Carlo method for policy learning.

REINFORCE

A Monte Carlo policy gradient algorithm that updates the policy parameter after each episode using the actual returns. Named by Williams (1992), it stands for "REward Increment = Nonnegative Factor × Offset Reinforcement × Characteristic Eligibility."

Deriving the Update

Starting from the policy gradient theorem, we can derive a sample-based update:

J(θ)Eπ[Gtπ(AtSt,θ)π(AtSt,θ)]\nabla J(\theta) \propto \mathbb{E}_\pi \left[ G_t \frac{\nabla \pi(A_t|S_t,\theta)}{\pi(A_t|S_t,\theta)} \right]

This leads to the REINFORCE update:

θt+1=θt+αGtπ(AtSt,θt)π(AtSt,θt)\theta_{t+1} = \theta_t + \alpha G_t \frac{\nabla \pi(A_t|S_t,\theta_t)}{\pi(A_t|S_t,\theta_t)}

Understanding the Update

  • G_tThe return from time t—tells us how good this trajectory was
  • ∇π/πThe eligibility vector ∇ln π—direction to increase action probability

Actions that led to high returns get reinforced (made more likely). Actions that led to low returns get discouraged.

The Eligibility Vector

The fraction ∇π(a|s,θ)/π(a|s,θ) is called the eligibility vector. It equals ∇ln π(a|s,θ). For soft-max policies with linear preferences:

lnπ(as,θ)=x(s,a)bπ(bs,θ)x(s,b)\nabla \ln \pi(a|s,\theta) = x(s,a) - \sum_b \pi(b|s,\theta) x(s,b)

This is the feature vector for the taken action minus the average feature vector across all actions.

REINFORCE Algorithm

Algorithm: REINFORCE (Monte Carlo Policy Gradient)

Input: differentiable policy parameterization π(a|s,θ)
Parameters: step size α > 0
Initialize: θ arbitrarily

Loop forever (for each episode):
    Generate episode S₀, A₀, R₁, ..., S_{T-1}, A_{T-1}, R_T
    following π(·|·,θ)

    Loop for each step t = 0, 1, ..., T-1:
        G ← Σ_{k=t+1}^T γ^{k-t-1} R_k    # Return from step t
        θ ← θ + α γ^t G ∇ln π(A_t|S_t,θ)
High Variance

REINFORCE has good theoretical properties but high variance because it uses complete returns. This can lead to slow learning. The baseline technique (next section) addresses this issue.

Section 13.4

REINFORCE with Baseline

We can reduce REINFORCE's variance by subtracting a baseline from the return. As long as the baseline doesn't depend on the action, it doesn't change the expected update but can dramatically reduce variance.

Baseline

A function b(s) subtracted from the return to reduce variance. The most common choice is the estimated state value v̂(s,w), which centers the returns around zero on average.

Why Baselines Work

Adding a baseline b(s) doesn't change the expected gradient:

ab(s)π(as,θ)=b(s)aπ(as,θ)=b(s)1=0\sum_a b(s) \nabla \pi(a|s,\theta) = b(s) \nabla \sum_a \pi(a|s,\theta) = b(s) \nabla 1 = 0

Since action probabilities sum to 1, the gradient of their sum is zero. So adding any state-dependent baseline doesn't bias the update.

REINFORCE with Baseline Update

θt+1=θt+α(Gtb(St))π(AtSt,θt)π(AtSt,θt)\theta_{t+1} = \theta_t + \alpha (G_t - b(S_t)) \frac{\nabla \pi(A_t|S_t,\theta_t)}{\pi(A_t|S_t,\theta_t)}

Intuition

  • If G_t > b(S_t): action was better than average → increase probability
  • If G_t < b(S_t): action was worse than average → decrease probability
  • The baseline tells us what return to "expect" from each state

Complete Algorithm

Algorithm: REINFORCE with Baseline

Input: differentiable policy π(a|s,θ)
Input: differentiable state-value function v̂(s,w)
Parameters: step sizes α^θ > 0, α^w > 0
Initialize: θ, w arbitrarily

Loop forever (for each episode):
    Generate episode following π(·|·,θ)

    Loop for each step t = 0, 1, ..., T-1:
        G ← Σ_{k=t+1}^T γ^{k-t-1} R_k
        δ ← G - v̂(S_t,w)                    # Advantage estimate
        w ← w + α^w δ ∇v̂(S_t,w)             # Update critic
        θ ← θ + α^θ γ^t δ ∇ln π(A_t|S_t,θ)  # Update actor
Variance Reduction

Adding a baseline can dramatically speed up learning. In experiments, REINFORCE with baseline often learns 5-10x faster than plain REINFORCE. The state value is the optimal baseline (in terms of minimizing variance) but other baselines can also work well.

Section 13.5

Actor-Critic Methods

REINFORCE uses the full return, which requires waiting until episode end. Actor-criticmethods use TD learning to update on every step, dramatically reducing variance at the cost of some bias.

Actor-Critic

A policy gradient method with two components: the actor (the learned policy π(a|s,θ)) and the critic (the learned value function v̂(s,w) that evaluates the actor's choices). The critic bootstraps to provide lower-variance updates.

One-Step Actor-Critic

Instead of the full return G_t, use the one-step TD target:

θt+1=θt+αδtπ(AtSt,θt)π(AtSt,θt)\theta_{t+1} = \theta_t + \alpha \delta_t \frac{\nabla \pi(A_t|S_t,\theta_t)}{\pi(A_t|S_t,\theta_t)}

where the TD error is:

δt=Rt+1+γv^(St+1,w)v^(St,w)\delta_t = R_{t+1} + \gamma \hat{v}(S_{t+1},w) - \hat{v}(S_t,w)

Why This Works

The TD error δ_t estimates the advantage of action A_t:

  • δ_t > 0: action was better than expected → reinforce
  • δ_t < 0: action was worse than expected → discourage

One-Step Actor-Critic Algorithm

Algorithm: One-Step Actor-Critic

Input: differentiable policy π(a|s,θ) and value function v̂(s,w)
Parameters: step sizes α^θ > 0, α^w > 0
Initialize: θ, w arbitrarily

Loop forever (for each episode):
    Initialize S, I ← 1

    Loop while S is not terminal:
        A ~ π(·|S,θ)
        Take action A, observe R, S'

        δ ← R + γv̂(S',w) - v̂(S,w)   # TD error
        w ← w + α^w δ ∇v̂(S,w)        # Update critic
        θ ← θ + α^θ I δ ∇ln π(A|S,θ) # Update actor

        I ← γI
        S ← S'

Actor-Critic with Eligibility Traces

We can add eligibility traces to both actor and critic for n-step or λ-return methods:

Actor-Critic with Traces

Eligibility trace updates:
    z^w ← γλ^w z^w + ∇v̂(S,w)           # Critic trace
    z^θ ← γλ^θ z^θ + I ∇ln π(A|S,θ)    # Actor trace

Weight updates:
    w ← w + α^w δ z^w                   # Update critic
    θ ← θ + α^θ δ z^θ                   # Update actor

Actor

  • • The learned policy π(a|s,θ)
  • • Decides which actions to take
  • • Updated toward better actions

Critic

  • • The learned value function v̂(s,w)
  • • Evaluates states and actions
  • • Provides TD error for actor updates
Section 13.6

Policy Gradient for Continuing Problems

For continuing problems without episode boundaries, we define performance as the average reward rate rather than the start state value.

Average Reward Setting

Performance is the average reward per time step:

J(θ)=r(π)=limh1ht=1hE[RtS0,A0:t1π]J(\theta) = r(\pi) = \lim_{h \to \infty} \frac{1}{h} \sum_{t=1}^{h} \mathbb{E}[R_t | S_0, A_{0:t-1} \sim \pi]

This equals the expected immediate reward under the stationary distribution μ.

Differential Returns

Values are defined with respect to differential returns (rewards minus the average):

Gt=Rt+1r(π)+Rt+2r(π)+Rt+3r(π)+G_t = R_{t+1} - r(\pi) + R_{t+2} - r(\pi) + R_{t+3} - r(\pi) + \cdots
Policy Gradient Theorem Still Holds

Remarkably, the policy gradient theorem remains the same in the continuing case. The only changes are: (1) performance is average reward, and (2) values use differential returns. The gradient formula is unchanged.

Continuing Actor-Critic

Algorithm: Actor-Critic for Continuing Tasks

Initialize: R̄ ← 0, w, θ, z^w ← 0, z^θ ← 0

Loop forever (for each time step):
    A ~ π(·|S,θ)
    Take action A, observe R, S'

    δ ← R - R̄ + v̂(S',w) - v̂(S,w)    # Differential TD error
    R̄ ← R̄ + α^R̄ δ                    # Update average reward

    z^w ← λ^w z^w + ∇v̂(S,w)
    z^θ ← λ^θ z^θ + ∇ln π(A|S,θ)

    w ← w + α^w δ z^w
    θ ← θ + α^θ δ z^θ

    S ← S'
Section 13.7

Policy Parameterization for Continuous Actions

One of the biggest advantages of policy gradient methods is their natural handling of continuous action spaces. Instead of learning probabilities for each action, we learn the parameters of a probability distribution.

Gaussian Policy

For continuous actions, a common approach is to output a Gaussian (normal) distribution. The policy network outputs the mean μ(s,θ) and standard deviation σ(s,θ), and actions are sampled from N(μ, σ²).

Gaussian Policy

The policy is defined as a normal distribution over actions:

π(as,θ)=1σ(s,θ)2πexp((aμ(s,θ))22σ(s,θ)2)\pi(a|s,\theta) = \frac{1}{\sigma(s,\theta)\sqrt{2\pi}} \exp\left(-\frac{(a - \mu(s,\theta))^2}{2\sigma(s,\theta)^2}\right)

Gaussian Policy Components

  • μ(s,θ)Mean of the distribution—the "preferred" action
  • σ(s,θ)Standard deviation—exploration amount (usually σ = exp(θ_σ⊤x_σ(s)))

Linear Parameterization

A simple linear form:

μ(s,θ)=θμxμ(s)andσ(s,θ)=exp(θσxσ(s))\mu(s,\theta) = \theta_\mu^\top x_\mu(s) \quad \text{and} \quad \sigma(s,\theta) = \exp(\theta_\sigma^\top x_\sigma(s))

The exponential ensures σ is always positive.

Eligibility Vectors for Gaussian Policy

The eligibility vectors (∇ln π) for the Gaussian policy:

θμlnπ(as,θ)=aμ(s,θ)σ(s,θ)2xμ(s)\nabla_{\theta_\mu} \ln \pi(a|s,\theta) = \frac{a - \mu(s,\theta)}{\sigma(s,\theta)^2} x_\mu(s)
θσlnπ(as,θ)=((aμ(s,θ))2σ(s,θ)21)xσ(s)\nabla_{\theta_\sigma} \ln \pi(a|s,\theta) = \left(\frac{(a - \mu(s,\theta))^2}{\sigma(s,\theta)^2} - 1\right) x_\sigma(s)
Applications

Gaussian policies are used extensively in robotics and control. Examples include: robotic arm control, autonomous driving, game playing with continuous controls, and any domain where actions are real-valued (torques, velocities, positions).

Beyond Gaussians

While Gaussian policies are common, other distributions can be used depending on the action space constraints:

  • Beta distribution: For bounded actions [0, 1]
  • Categorical: For discrete actions (soft-max)
  • Mixture models: For multi-modal action distributions
  • Normalizing flows: For complex distributions (deep RL)
Section 13.8

Summary

Policy gradient methods represent a fundamental shift from value-based to policy-based learning. They offer unique advantages and have become central to modern reinforcement learning, especially in deep RL.

Key Takeaways

  • Policy gradient methods learn a parameterized policy directly
  • The policy gradient theorem enables gradient estimation without knowing state distribution derivatives
  • REINFORCE is a simple Monte Carlo policy gradient with high variance
  • Baselines reduce variance without introducing bias
  • Actor-critic methods combine policy gradients with TD learning
  • Continuous action spaces are naturally handled via Gaussian policies

Method Comparison

Value-Based (Q-learning, etc.)

  • • Learn Q(s,a), derive policy
  • • Discrete actions typically
  • • Can have unstable policy changes
  • • Off-policy learning possible

Policy Gradient

  • • Learn π(a|s,θ) directly
  • • Continuous actions natural
  • • Smooth policy updates
  • • Strong convergence guarantees

Glossary

Policy Gradient
The gradient of performance J(θ) with respect to policy parameters θ. Used to update the policy in the direction of improvement.
REINFORCE
A Monte Carlo policy gradient algorithm that uses complete episode returns. Simple but high variance.
Baseline
A function subtracted from the return to reduce variance without changing the expected gradient. Often the state value function.
Actor
The learned policy π(a|s,θ) in actor-critic methods. Makes action decisions.
Critic
The learned value function v̂(s,w) in actor-critic methods. Evaluates states to guide actor updates.
Eligibility Vector
The vector ∇ln π(a|s,θ), indicating the direction to change θ to make action a more likely in state s.
Soft-max Policy
A policy that converts action preferences to probabilities using the soft-max function. Common for discrete actions.
Gaussian Policy
A policy that outputs mean and standard deviation of a normal distribution for continuous actions.

Formula Cheat Sheet

Core Formulas

Policy Gradient Theorem:

J(θ)sμ(s)aqπ(s,a)π(as,θ)\nabla J(\theta) \propto \sum_s \mu(s) \sum_a q_\pi(s,a) \nabla \pi(a|s,\theta)

REINFORCE Update:

θθ+αGtlnπ(AtSt,θ)\theta \leftarrow \theta + \alpha G_t \nabla \ln \pi(A_t|S_t,\theta)

REINFORCE with Baseline:

θθ+α(Gtb(St))lnπ(AtSt,θ)\theta \leftarrow \theta + \alpha (G_t - b(S_t)) \nabla \ln \pi(A_t|S_t,\theta)

Actor-Critic Update:

θθ+αδtlnπ(AtSt,θ)\theta \leftarrow \theta + \alpha \delta_t \nabla \ln \pi(A_t|S_t,\theta)

Soft-max Policy:

π(as,θ)=eh(s,a,θ)beh(s,b,θ)\pi(a|s,\theta) = \frac{e^{h(s,a,\theta)}}{\sum_b e^{h(s,b,\theta)}}

Common Mistakes

Not using a baseline with REINFORCE

Plain REINFORCE has very high variance. Always use at least a simple baseline (like the average return) to speed up learning significantly.

Using too large a step size

Policy gradient methods can be unstable with large step sizes. The policy gradient theorem guarantees improvement only for "sufficiently small" α.

Forgetting the γ^t factor in episodic tasks

For discounted episodic tasks, the update should include γ^t to properly weight actions by when they occur. This is often overlooked in implementations.

Confusing the gradient direction

We do gradient ASCENT on J(θ), not descent. We want to maximize performance, so we add the gradient, not subtract it.

Using action-dependent baselines incorrectly

The baseline must not depend on the action a for the unbiasedness proof to hold. Using Q(s,a) as a baseline introduces bias (though it might still work in practice).

For the Curious Mind

Why "REINFORCE"?

Williams named the algorithm REINFORCE as an acronym: "REward Increment = Nonnegative Factor × Offset Reinforcement × Characteristic Eligibility." While the name is a bit contrived, it captures the key components: reward-weighted updates in the direction of the eligibility vector.

Connection to Supervised Learning

The eligibility vector ∇ln π is exactly the gradient used in supervised learning with cross-entropy loss! Policy gradients can be seen as supervised learning where the "labels" (which actions were good) come from the returns rather than a dataset.

Modern Extensions

Modern policy gradient methods like PPO (Proximal Policy Optimization) and TRPO (Trust Region Policy Optimization) add constraints to prevent too-large policy updates. These methods dominate practical deep RL because they're more stable than vanilla policy gradients while keeping their advantages.

Natural Gradients

The "natural gradient" replaces the standard gradient with one that accounts for the geometry of probability distributions. It often leads to faster, more stable learning by taking steps of equal size in "policy space" rather than "parameter space."

Deterministic Policy Gradients

While this chapter focuses on stochastic policies, deterministic policies μ(s) (outputting a single action, not a distribution) also have a gradient theorem! This is the basis for algorithms like DDPG used in continuous control.