Policy Gradient Methods
Learning Parameterized Policies Directly
Until now, we've learned value functions and derived policies from them. Policy gradient methods take a fundamentally different approach: they learn the policy directly. This enables handling continuous actions, learning stochastic policies, and often leads to more stable learning.
Policy Approximation and its Advantages
Instead of learning action values and deriving a policy (like ε-greedy), we can directly learn a parameterized policy π(a|s,θ) that maps states to action probabilities. The parameter vector θ is what we learn.
Methods that learn a parameterized policy π(a|s,θ) by adjusting θ in the direction of the gradient of some performance measure J(θ). The policy can select actions without consulting a value function, though value functions are often used to help learn the policy.
The Basic Update
Policy gradient methods perform gradient ascent on the performance measure:
where the estimate approximates the true gradient of performance with respect to θ.
Soft-max Policy Parameterization
For discrete actions, a common approach is to learn action preferences h(s,a,θ) and convert them to probabilities using a soft-max:
The preferences can be linear: h(s,a,θ) = θ⊤x(s,a), or computed by a neural network.
Advantages Over Value-Based Methods
Approach Determinism
Soft-max policies can approach deterministic behavior, unlike ε-greedy which always has ε probability of random action.
Stochastic Optimal Policies
Can learn optimal stochastic policies (important in games with imperfect information like Poker where bluffing requires randomness).
Simpler Policy
Sometimes the policy is simpler than the value function. In these cases, policy-based methods learn faster.
Continuous Actions
Naturally handle continuous action spaces by learning parameters of a distribution (like mean and variance of a Gaussian).
Consider a corridor where all states look identical under function approximation. ε-greedy must choose between "always go right" or "always go left." But the optimal policy might be "go right with probability 0.59." Only policy gradient methods can learn this specific probability.
The Policy Gradient Theorem
A fundamental challenge: performance depends on both action selections AND the state distribution, both of which change when we update θ. How can we compute the gradient when the state distribution is unknown?
The policy gradient theorem provides an analytical expression for ∇J(θ) that doesn't require the derivative of the state distribution. This is the theoretical foundation for all policy gradient methods.
Performance Measure
For episodic tasks, performance is the value of the start state:
The Theorem
The policy gradient theorem establishes:
Breaking Down the Theorem
- ∇J(θ)Gradient of performance—what we want to estimate
- μ(s)On-policy state distribution—how often we visit each state
- q_π(s,a)Action value—how good is action a in state s
- ∇π(a|s,θ)Policy gradient—how to change θ to make action a more likely
The gradient involves the state distribution μ(s), but we don't need to know its derivative! If we follow the policy π, we naturally visit states according to μ. So we can sample from this expression without ever computing the distribution explicitly.
Smooth Updates
Policy gradient methods have better convergence properties than value-based methods because:
- Action probabilities change smoothly as θ changes
- No sudden policy changes from small value estimate changes
- Guaranteed improvement for small enough step sizes
REINFORCE: Monte Carlo Policy Gradient
REINFORCE is the simplest policy gradient algorithm. It uses complete episode returns to estimate the gradient, making it a Monte Carlo method for policy learning.
A Monte Carlo policy gradient algorithm that updates the policy parameter after each episode using the actual returns. Named by Williams (1992), it stands for "REward Increment = Nonnegative Factor × Offset Reinforcement × Characteristic Eligibility."
Deriving the Update
Starting from the policy gradient theorem, we can derive a sample-based update:
This leads to the REINFORCE update:
Understanding the Update
- G_tThe return from time t—tells us how good this trajectory was
- ∇π/πThe eligibility vector ∇ln π—direction to increase action probability
Actions that led to high returns get reinforced (made more likely). Actions that led to low returns get discouraged.
The Eligibility Vector
The fraction ∇π(a|s,θ)/π(a|s,θ) is called the eligibility vector. It equals ∇ln π(a|s,θ). For soft-max policies with linear preferences:
This is the feature vector for the taken action minus the average feature vector across all actions.
REINFORCE Algorithm
Algorithm: REINFORCE (Monte Carlo Policy Gradient)
Input: differentiable policy parameterization π(a|s,θ)
Parameters: step size α > 0
Initialize: θ arbitrarily
Loop forever (for each episode):
Generate episode S₀, A₀, R₁, ..., S_{T-1}, A_{T-1}, R_T
following π(·|·,θ)
Loop for each step t = 0, 1, ..., T-1:
G ← Σ_{k=t+1}^T γ^{k-t-1} R_k # Return from step t
θ ← θ + α γ^t G ∇ln π(A_t|S_t,θ)REINFORCE has good theoretical properties but high variance because it uses complete returns. This can lead to slow learning. The baseline technique (next section) addresses this issue.
REINFORCE with Baseline
We can reduce REINFORCE's variance by subtracting a baseline from the return. As long as the baseline doesn't depend on the action, it doesn't change the expected update but can dramatically reduce variance.
A function b(s) subtracted from the return to reduce variance. The most common choice is the estimated state value v̂(s,w), which centers the returns around zero on average.
Why Baselines Work
Adding a baseline b(s) doesn't change the expected gradient:
Since action probabilities sum to 1, the gradient of their sum is zero. So adding any state-dependent baseline doesn't bias the update.
REINFORCE with Baseline Update
Intuition
- •If G_t > b(S_t): action was better than average → increase probability
- •If G_t < b(S_t): action was worse than average → decrease probability
- •The baseline tells us what return to "expect" from each state
Complete Algorithm
Algorithm: REINFORCE with Baseline
Input: differentiable policy π(a|s,θ)
Input: differentiable state-value function v̂(s,w)
Parameters: step sizes α^θ > 0, α^w > 0
Initialize: θ, w arbitrarily
Loop forever (for each episode):
Generate episode following π(·|·,θ)
Loop for each step t = 0, 1, ..., T-1:
G ← Σ_{k=t+1}^T γ^{k-t-1} R_k
δ ← G - v̂(S_t,w) # Advantage estimate
w ← w + α^w δ ∇v̂(S_t,w) # Update critic
θ ← θ + α^θ γ^t δ ∇ln π(A_t|S_t,θ) # Update actorAdding a baseline can dramatically speed up learning. In experiments, REINFORCE with baseline often learns 5-10x faster than plain REINFORCE. The state value is the optimal baseline (in terms of minimizing variance) but other baselines can also work well.
Actor-Critic Methods
REINFORCE uses the full return, which requires waiting until episode end. Actor-criticmethods use TD learning to update on every step, dramatically reducing variance at the cost of some bias.
A policy gradient method with two components: the actor (the learned policy π(a|s,θ)) and the critic (the learned value function v̂(s,w) that evaluates the actor's choices). The critic bootstraps to provide lower-variance updates.
One-Step Actor-Critic
Instead of the full return G_t, use the one-step TD target:
where the TD error is:
Why This Works
The TD error δ_t estimates the advantage of action A_t:
- •δ_t > 0: action was better than expected → reinforce
- •δ_t < 0: action was worse than expected → discourage
One-Step Actor-Critic Algorithm
Algorithm: One-Step Actor-Critic
Input: differentiable policy π(a|s,θ) and value function v̂(s,w)
Parameters: step sizes α^θ > 0, α^w > 0
Initialize: θ, w arbitrarily
Loop forever (for each episode):
Initialize S, I ← 1
Loop while S is not terminal:
A ~ π(·|S,θ)
Take action A, observe R, S'
δ ← R + γv̂(S',w) - v̂(S,w) # TD error
w ← w + α^w δ ∇v̂(S,w) # Update critic
θ ← θ + α^θ I δ ∇ln π(A|S,θ) # Update actor
I ← γI
S ← S'Actor-Critic with Eligibility Traces
We can add eligibility traces to both actor and critic for n-step or λ-return methods:
Actor-Critic with Traces
Eligibility trace updates:
z^w ← γλ^w z^w + ∇v̂(S,w) # Critic trace
z^θ ← γλ^θ z^θ + I ∇ln π(A|S,θ) # Actor trace
Weight updates:
w ← w + α^w δ z^w # Update critic
θ ← θ + α^θ δ z^θ # Update actorActor
- • The learned policy π(a|s,θ)
- • Decides which actions to take
- • Updated toward better actions
Critic
- • The learned value function v̂(s,w)
- • Evaluates states and actions
- • Provides TD error for actor updates
Policy Gradient for Continuing Problems
For continuing problems without episode boundaries, we define performance as the average reward rate rather than the start state value.
Average Reward Setting
Performance is the average reward per time step:
This equals the expected immediate reward under the stationary distribution μ.
Differential Returns
Values are defined with respect to differential returns (rewards minus the average):
Remarkably, the policy gradient theorem remains the same in the continuing case. The only changes are: (1) performance is average reward, and (2) values use differential returns. The gradient formula is unchanged.
Continuing Actor-Critic
Algorithm: Actor-Critic for Continuing Tasks
Initialize: R̄ ← 0, w, θ, z^w ← 0, z^θ ← 0
Loop forever (for each time step):
A ~ π(·|S,θ)
Take action A, observe R, S'
δ ← R - R̄ + v̂(S',w) - v̂(S,w) # Differential TD error
R̄ ← R̄ + α^R̄ δ # Update average reward
z^w ← λ^w z^w + ∇v̂(S,w)
z^θ ← λ^θ z^θ + ∇ln π(A|S,θ)
w ← w + α^w δ z^w
θ ← θ + α^θ δ z^θ
S ← S'Policy Parameterization for Continuous Actions
One of the biggest advantages of policy gradient methods is their natural handling of continuous action spaces. Instead of learning probabilities for each action, we learn the parameters of a probability distribution.
For continuous actions, a common approach is to output a Gaussian (normal) distribution. The policy network outputs the mean μ(s,θ) and standard deviation σ(s,θ), and actions are sampled from N(μ, σ²).
Gaussian Policy
The policy is defined as a normal distribution over actions:
Gaussian Policy Components
- μ(s,θ)Mean of the distribution—the "preferred" action
- σ(s,θ)Standard deviation—exploration amount (usually σ = exp(θ_σ⊤x_σ(s)))
Linear Parameterization
A simple linear form:
The exponential ensures σ is always positive.
Eligibility Vectors for Gaussian Policy
The eligibility vectors (∇ln π) for the Gaussian policy:
Gaussian policies are used extensively in robotics and control. Examples include: robotic arm control, autonomous driving, game playing with continuous controls, and any domain where actions are real-valued (torques, velocities, positions).
Beyond Gaussians
While Gaussian policies are common, other distributions can be used depending on the action space constraints:
- • Beta distribution: For bounded actions [0, 1]
- • Categorical: For discrete actions (soft-max)
- • Mixture models: For multi-modal action distributions
- • Normalizing flows: For complex distributions (deep RL)
Summary
Policy gradient methods represent a fundamental shift from value-based to policy-based learning. They offer unique advantages and have become central to modern reinforcement learning, especially in deep RL.
Key Takeaways
- Policy gradient methods learn a parameterized policy directly
- The policy gradient theorem enables gradient estimation without knowing state distribution derivatives
- REINFORCE is a simple Monte Carlo policy gradient with high variance
- Baselines reduce variance without introducing bias
- Actor-critic methods combine policy gradients with TD learning
- Continuous action spaces are naturally handled via Gaussian policies
Method Comparison
Value-Based (Q-learning, etc.)
- • Learn Q(s,a), derive policy
- • Discrete actions typically
- • Can have unstable policy changes
- • Off-policy learning possible
Policy Gradient
- • Learn π(a|s,θ) directly
- • Continuous actions natural
- • Smooth policy updates
- • Strong convergence guarantees
Glossary
- Policy Gradient
- The gradient of performance J(θ) with respect to policy parameters θ. Used to update the policy in the direction of improvement.
- REINFORCE
- A Monte Carlo policy gradient algorithm that uses complete episode returns. Simple but high variance.
- Baseline
- A function subtracted from the return to reduce variance without changing the expected gradient. Often the state value function.
- Actor
- The learned policy π(a|s,θ) in actor-critic methods. Makes action decisions.
- Critic
- The learned value function v̂(s,w) in actor-critic methods. Evaluates states to guide actor updates.
- Eligibility Vector
- The vector ∇ln π(a|s,θ), indicating the direction to change θ to make action a more likely in state s.
- Soft-max Policy
- A policy that converts action preferences to probabilities using the soft-max function. Common for discrete actions.
- Gaussian Policy
- A policy that outputs mean and standard deviation of a normal distribution for continuous actions.
Formula Cheat Sheet
Core Formulas
Policy Gradient Theorem:
REINFORCE Update:
REINFORCE with Baseline:
Actor-Critic Update:
Soft-max Policy:
Common Mistakes
Not using a baseline with REINFORCE
Plain REINFORCE has very high variance. Always use at least a simple baseline (like the average return) to speed up learning significantly.
Using too large a step size
Policy gradient methods can be unstable with large step sizes. The policy gradient theorem guarantees improvement only for "sufficiently small" α.
Forgetting the γ^t factor in episodic tasks
For discounted episodic tasks, the update should include γ^t to properly weight actions by when they occur. This is often overlooked in implementations.
Confusing the gradient direction
We do gradient ASCENT on J(θ), not descent. We want to maximize performance, so we add the gradient, not subtract it.
Using action-dependent baselines incorrectly
The baseline must not depend on the action a for the unbiasedness proof to hold. Using Q(s,a) as a baseline introduces bias (though it might still work in practice).
For the Curious Mind
Why "REINFORCE"?
Williams named the algorithm REINFORCE as an acronym: "REward Increment = Nonnegative Factor × Offset Reinforcement × Characteristic Eligibility." While the name is a bit contrived, it captures the key components: reward-weighted updates in the direction of the eligibility vector.
Connection to Supervised Learning
The eligibility vector ∇ln π is exactly the gradient used in supervised learning with cross-entropy loss! Policy gradients can be seen as supervised learning where the "labels" (which actions were good) come from the returns rather than a dataset.
Modern Extensions
Modern policy gradient methods like PPO (Proximal Policy Optimization) and TRPO (Trust Region Policy Optimization) add constraints to prevent too-large policy updates. These methods dominate practical deep RL because they're more stable than vanilla policy gradients while keeping their advantages.
Natural Gradients
The "natural gradient" replaces the standard gradient with one that accounts for the geometry of probability distributions. It often leads to faster, more stable learning by taking steps of equal size in "policy space" rather than "parameter space."
Deterministic Policy Gradients
While this chapter focuses on stochastic policies, deterministic policies μ(s) (outputting a single action, not a distribution) also have a gradient theorem! This is the basis for algorithms like DDPG used in continuous control.