Eligibility Traces
Unifying TD and Monte Carlo Methods
Eligibility traces provide one of the most important unifying mechanisms in reinforcement learning. They elegantly bridge the gap between TD(0) and Monte Carlo methods, allowing you to smoothly interpolate between the two extremes and often achieve better performance than either.
The λ-return
Before we can understand eligibility traces, we need to understand the λ-return—a clever way to combine all possible n-step returns into a single, weighted average. This is the theoretical foundation that makes TD(λ) so powerful.
In n-step TD, we had to choose how many steps to look ahead. But what if we could use ALL possible n-step returns, from 1-step all the way to the full Monte Carlo return? That's exactly what the λ-return does—it's a weighted combination of all n-step returns.
The Compound Return
The λ-return combines all n-step returns using a clever weighting scheme. Each n-step return gets weight (1-λ)λⁿ⁻¹, which forms a geometric series that sums to 1:
Breaking Down the Math
- G_t^λThe λ-return starting at time t—our combined estimate
- (1-λ)Normalization factor ensuring weights sum to 1
- λⁿ⁻¹Weight for the n-step return—decays exponentially with n
- G_{t:t+n}The n-step return (rewards plus bootstrapped value)
The λ Parameter
The parameter λ ∈ [0, 1] controls how we blend short and long-term returns:
λ = 0: Pure TD(0)
All weight goes to the 1-step return. Fast updates, high bias, low variance.
λ = 1: Monte Carlo
All weight goes to the complete return. No bias, high variance, needs episode end.
Think of λ as a "decay rate for trust." With high λ, you trust longer returns more (looking further into the actual future). With low λ, you trust your current value estimates more (bootstrapping sooner). Most practical algorithms use λ between 0.8 and 0.95.
Weight Distribution
Here's how the weights distribute across n-step returns for different λ values:
| n-step | λ=0 | λ=0.5 | λ=0.9 |
|---|---|---|---|
| 1-step | 100% | 50% | 10% |
| 2-step | 0% | 25% | 9% |
| 3-step | 0% | 12.5% | 8.1% |
| ∞-step (MC) | 0% | ~0% | ~73% |
Termination Handling
In episodic tasks, the λ-return needs special handling when an episode ends. After termination, all remaining weight goes to the actual Monte Carlo return:
The λ-return represents the "forward view"—to compute it, you need to look forward in time to see all future rewards. This isn't practical for online learning. The magic of eligibility traces is that they provide an equivalent "backward view" that can be computed incrementally!
TD(λ)
TD(λ) is the algorithm that makes the λ-return practical for online learning. Instead of looking forward to compute returns, it looks backward using eligibility traces—a memory of which states contributed to the current situation.
An eligibility trace is a temporary record of which states (or state-action pairs) have been visited recently. When a TD error occurs, credit or blame is assigned to all states in proportion to their eligibility—how recently and frequently they were visited.
The Eligibility Trace Vector
In function approximation, we maintain an eligibility trace vector z that has the same dimension as our weight vector w. It accumulates and decays over time:
Understanding the Trace Update
- γλz_{t-1}Decay the old trace by γλ (discount × trace parameter)
- ∇v̂(S_t, w_t)Add the gradient of the value function at the current state
- z_0 = 0Traces start at zero at the beginning of each episode
The TD(λ) Weight Update
The weight update uses the TD error δ and distributes it across all features according to their eligibility:
The eligibility trace acts like a "bridge to the past." When you get a reward, the trace tells you which states led to this moment. States visited recently have high eligibility and get more credit. States visited long ago have low eligibility (due to decay) and get less credit. This is exactly what the λ-return does, but computed efficiently!
Semi-gradient TD(λ) Algorithm
Algorithm: Semi-gradient TD(λ)
Input: policy π, feature function x(s)
Parameters: step size α ∈ (0, 1], trace decay λ ∈ [0, 1]
Initialize: w arbitrarily, z = 0
For each episode:
Initialize S (first state of episode)
z ← 0 # Reset eligibility trace
For each step of episode:
A ← action given by π for S
Take action A, observe R, S'
δ ← R + γv̂(S', w) - v̂(S, w) # TD error
z ← γλz + ∇v̂(S, w) # Update trace
w ← w + αδz # Update weights
S ← S'
until S is terminalLinear Function Approximation
For linear function approximation where v̂(s, w) = w⊤x(s), the gradient is simply the feature vector:
There are different types of eligibility traces:
- Accumulating traces: Add 1 (or x) each visit, as shown above
- Replacing traces: Reset to 1 on visit, don't accumulate
- Dutch traces: A more sophisticated variant we'll see later
n-step Truncated λ-return Methods
The offline λ-return algorithm requires waiting until the end of the episode to make updates. The truncated λ-return provides a middle ground—it uses a horizon of n steps and can make updates before the episode ends.
Instead of summing to infinity (or episode end), the truncated λ-return only looks ahead n steps. This allows for online updates while still capturing multi-step benefits.
The Truncated λ-return
The key insight is that this becomes the full λ-return when n extends to the episode end, but can be computed with only n steps of lookahead.
Trade-offs
Advantages
- • Can update during episode
- • Bounded memory and computation
- • Practical for continuing tasks
Disadvantages
- • Updates delayed by n steps
- • May miss long-range dependencies
- • Extra complexity vs TD(λ)
Truncated TD(λ)
The update at time t uses information up to time t+n:
The horizon n controls the trade-off between update delay and accuracy. Larger n means more accurate λ-return approximation but longer delays. In practice, n values of 3-10 often work well. The key is that n should be at least as large as the typical reward delay.
Redoing Updates: Online λ-return Algorithm
The offline λ-return algorithm waits until the end of an episode to make all updates. But what if we could get the same result while updating online? The online λ-return algorithm achieves this by "redoing" updates as new information arrives.
Instead of making one update per state at the end, make updates online but redo earlier updates whenever you get new information. This produces exactly the same final weights as the offline algorithm, but with online updates.
The Online λ-return Algorithm
At each time step h, we maintain a sequence of weight vectors w₀ʰ, w₁ʰ, ..., wₕʰ that represent what the weights would be if the episode had ended at time h:
How Redoing Works
Think of it as maintaining a "sliding window" of weight histories:
- 1.At time h, start from w₀ʰ (same as previous final weights)
- 2.Redo all updates from t=1 to t=h using the truncated λ-return to horizon h
- 3.The final weight wₕʰ is used for the next step
The naive online λ-return algorithm has O(h²) complexity per episode because you redo all previous updates at each step. This makes it impractical for long episodes. Fortunately, True Online TD(λ) solves this problem elegantly.
Why Bother with Redoing?
The online λ-return algorithm is primarily of theoretical interest. It shows that:
- Online updates CAN produce the same result as offline λ-return
- It provides a bridge to understanding True Online TD(λ)
- It motivates the development of more efficient algorithms
True Online TD(λ)
True Online TD(λ) is one of the most important algorithms in this chapter. It achieves the same result as the online λ-return algorithm but with O(d) complexity per step (where d is the number of features)—the same as regular TD(λ)!
An efficient algorithm that produces exactly the same sequence of weight vectors as the online λ-return algorithm, but with linear-time updates. The key innovation is the Dutch trace, which modifies how eligibility traces accumulate.
The Dutch Trace
The magic of True Online TD(λ) lies in its modified eligibility trace:
What Makes It Different
Compare to the standard accumulating trace z = γλz + x:
- Standard:Just adds the feature vector x
- Dutch:Adds (1 - αγλz⊤x)x, a "corrected" feature vector
The correction term (1 - αγλz⊤x) ensures that repeated visits to similar states don't cause over-crediting.
Complete Algorithm
Algorithm: True Online TD(λ)
Input: policy π, linear features x(s) with d components
Parameters: α ∈ (0, 1], λ ∈ [0, 1]
Initialize: w ∈ ℝᵈ arbitrarily
For each episode:
Initialize S, x ← x(S)
z ← 0 # Eligibility trace (d-dimensional)
V_old ← 0 # Previous value estimate
For each step of episode:
A ← action given by π for S
Take action A, observe R, S'
x' ← x(S')
V ← w⊤x # Current value
V' ← w⊤x' # Next value
δ ← R + γV' - V # TD error
# Dutch trace update
z ← γλz + (1 - αγλz⊤x)x
# Weight update with extra correction
w ← w + α(δ + V - V_old)z - α(V - V_old)x
V_old ← V'
x ← x'
S ← S'
until S is terminalWhy "True Online"?
True Online
- • Updates are online (every step)
- • Result equals offline λ-return
- • O(d) per step complexity
- • Theoretically optimal
Regular TD(λ)
- • Updates are online
- • Result differs from λ-return
- • O(d) per step complexity
- • Good approximation
Use True Online TD(λ) when you want the theoretical guarantees of the λ-return with efficient online updates. It's particularly valuable in domains where the extra complexity is worth the improved convergence properties, like robotics or game AI.
Dutch Traces in Monte Carlo Learning
The Dutch trace idea can also improve Monte Carlo learning. When λ = 1, TD(λ) becomes equivalent to Monte Carlo, but the Dutch trace still provides benefits by preventing over-crediting when states are revisited within an episode.
In standard MC with function approximation, if you visit similar states multiple times in an episode, their feature vectors accumulate and can cause excessive updates. Dutch traces automatically correct for this overlap.
The Problem with Accumulating Traces
Consider visiting states with similar (or identical) feature vectors. With accumulating traces, the eligibility grows proportionally to visits. With Dutch traces, the overlap is automatically accounted for:
Example: Revisiting a State
Suppose x = [1, 0, 0] and we visit this state twice with α = 0.1:
| Trace Type | After 1st visit | After 2nd visit |
|---|---|---|
| Accumulating | [1, 0, 0] | [2, 0, 0] |
| Dutch | [1, 0, 0] | [1.9, 0, 0] |
Dutch traces grow more slowly, preventing over-crediting for repeated visits.
The name "Dutch trace" comes from Harm van Seijen, who developed this technique. It's also sometimes called the "correction trace" because it corrects for the overlap between successive feature vectors.
Sarsa(λ)
Sarsa(λ) extends eligibility traces to control problems. Instead of learning state values, we learn action values q(s, a), and the traces now track state-action pairs.
The action-value version of TD(λ). It uses eligibility traces to credit all recently visited state-action pairs when a TD error occurs, allowing faster learning of the action-value function.
The Sarsa(λ) Update
The key equations for Sarsa(λ) parallel those of TD(λ), but for action values:
Algorithm
Algorithm: Semi-gradient Sarsa(λ)
Input: feature function x(s, a)
Parameters: α ∈ (0, 1], λ ∈ [0, 1], ε > 0
Initialize: w ∈ ℝᵈ arbitrarily
For each episode:
Initialize S
A ← ε-greedy action for S based on q̂(S, ·, w)
z ← 0
For each step of episode:
Take action A, observe R, S'
if S' is terminal:
δ ← R - q̂(S, A, w)
w ← w + αδz
break
A' ← ε-greedy action for S' based on q̂(S', ·, w)
δ ← R + γq̂(S', A', w) - q̂(S, A, w)
z ← γλz + ∇q̂(S, A, w)
w ← w + αδz
S ← S', A ← A'True Online Sarsa(λ)
Just as with TD(λ), there's a "true online" version of Sarsa(λ) that uses Dutch traces:
Sarsa(λ) often outperforms Sarsa(0) because it can assign credit to actions taken several steps before a reward. This is especially important in sparse reward problems where rewards come infrequently. Values of λ between 0.8 and 0.95 typically work well.
Mountain Car Example
In the Mountain Car problem, the agent must build momentum to reach the goal. With Sarsa(0), learning is slow because only the final action before reaching the goal gets credit. With Sarsa(λ), all the actions that built momentum receive credit, leading to much faster learning.
Experiments show Sarsa(λ=0.9) can solve Mountain Car in ~200 episodes, while Sarsa(0) may take ~1000+ episodes.
Variable λ and γ
So far we've treated λ and γ as fixed constants. But in some cases, it's useful to let them vary based on state or time. This leads to more general algorithms with state-dependent bootstrapping and termination.
Instead of fixed λ and γ, we can use functions λ(s) and γ(s) that depend on the current state. This allows the algorithm to bootstrap more in some states and less in others, adapting to the structure of the problem.
Generalized λ-return
With state-dependent parameters, the λ-return becomes:
Understanding Variable Parameters
- λ(s)Controls bootstrapping: low λ = bootstrap more, high λ = use actual returns more
- γ(s)Controls termination: γ = 0 acts like episode end, γ = 1 means full discounting
Use Cases for Variable Parameters
Variable γ
Model continuing tasks with "soft" termination, handle safety-critical states differently, or implement options/temporally abstract actions.
Variable λ
Bootstrap more in well-understood states, use longer traces when value estimates are uncertain, adapt to state-dependent noise.
Modified Trace Update
With variable parameters, the eligibility trace update becomes:
Variable parameters add complexity but can significantly improve performance in some domains. They're most useful when you have domain knowledge about which states need more or less bootstrapping. Without such knowledge, fixed parameters often work well.
Off-policy Traces with Control Variates
Extending eligibility traces to off-policy learning is challenging. The behavior policy (what we actually do) differs from the target policy (what we want to learn about).Importance sampling with control variates provides a solution.
In off-policy learning, actions come from behavior policy b but we want to learn the value of target policy π. Naive importance sampling can have very high variance. Control variates reduce this variance by subtracting terms that have zero expected value.
Per-decision Importance Sampling
The importance sampling ratio for a single action is:
This ratio reweights each step to account for the policy difference.
Off-policy Eligibility Traces
The eligibility trace for off-policy learning incorporates importance sampling:
The Variance Problem
With importance sampling, if ρ is large (target much more likely than behavior), the trace can explode. If ρ is small (target unlikely), the trace quickly decays to zero.
This leads to high variance—some updates are huge, others are negligible.
Control Variates
Control variates reduce variance by adding terms that have zero expected value but are correlated with the high-variance terms. The key insight is:
By subtracting v̂(Sₜ, w) and adding it back weighted appropriately, we can reduce variance while maintaining the same expected update:
Think of control variates as "centering" the updates. Instead of updating by ρ × (full term), we update by ρ × (term - baseline) + baseline. The baseline reduces variance because it's not multiplied by the potentially large ρ.
Watkins's Q(λ) to Tree-Backup(λ)
There are several ways to extend Q-learning to use eligibility traces. Watkins's Q(λ)was the first attempt, but Tree-Backup(λ) provides a more principled solution that doesn't require importance sampling.
Q-learning is off-policy: we take exploratory actions but learn about the greedy policy. When using traces, we need to handle the mismatch between what we did (possibly exploratory) and what the greedy policy would have done.
Watkins's Q(λ)
The original Q(λ) by Watkins takes a simple approach: cut the trace whenever an exploratory (non-greedy) action is taken.
Problem with Watkins's Q(λ)
With ε-greedy exploration, exploratory actions happen with probability ε at each step. This means traces are cut frequently, limiting the effective trace length. With ε = 0.1, the average trace length before cutting is only 10 steps!
Peng's Q(λ)
Peng's Q(λ) attempts to fix this by not cutting traces, but it's not truly off-policy and doesn't converge to the optimal policy in general.
Tree-Backup(λ)
Tree-Backup(λ) provides a principled solution. Instead of cutting traces, it weights updates by the probability of taking actions under the target policy:
The TD error also changes to account for the expected value over all actions:
Tree-Backup doesn't cut traces but gracefully decays them based on how likely the target policy was to take each action. Greedy actions (π(a|s) = 1) maintain the full trace. Exploratory actions (π(a|s) = 0 for non-greedy) still contribute but decay the trace for earlier states.
Watkins's Q(λ)
- • Binary trace cutting
- • Limited effective trace length
- • Simple implementation
Tree-Backup(λ)
- • Smooth trace weighting
- • Full use of traces
- • No importance sampling needed
Stable Off-policy Methods with Traces
Combining off-policy learning, function approximation, and eligibility traces can lead to instability—the "deadly triad." This section presents algorithms designed to be stable even with all three elements.
The deadly triad (function approximation + bootstrapping + off-policy) is even more dangerous with eligibility traces. Traces can amplify instabilities by propagating errors across many steps. Stable algorithms are essential for practical use.
GTD(λ) - Gradient TD with Traces
GTD(λ) extends the stable Gradient-TD methods to use eligibility traces:
GTD(λ) Components
- wPrimary weights for value estimation
- vAuxiliary weights for stability (estimates expected TD error)
- zEligibility trace (with importance sampling for off-policy)
GQ(λ) - For Action Values
GQ(λ) is the action-value version, suitable for control problems:
HTD(λ) - Hybrid TD
HTD(λ) combines the benefits of GTD(λ) stability with TD(λ) efficiency when on-policy:
- GTD(λ): When you need guaranteed stability, don't mind the extra computation
- GQ(λ): For off-policy control problems with traces
- HTD(λ): When you want stability but expect mostly on-policy behavior
Emphatic TD(λ)
Emphatic TD(λ) uses emphasis to reweight updates, achieving stability through a different mechanism:
Emphatic TD(λ) allows you to specify which states you care about (interest) and automatically handles the off-policy correction. It's particularly useful when you have prior knowledge about which states matter most.
Implementation Issues
Implementing eligibility traces in practice requires attention to several details. This section covers common issues and best practices for efficient, correct implementations.
Sparse Traces
When using sparse features (like tile coding), most trace entries are zero. Don't store the full trace vector! Instead, keep track of only the non-zero entries and decay/remove them when they become negligible.
Sparse Trace Implementation
# Python pseudocode for sparse traces
trace = {} # Dictionary: feature_index -> eligibility
def update_trace(active_features, gamma_lambda, threshold=1e-5):
# Decay all existing traces
for idx in list(trace.keys()):
trace[idx] *= gamma_lambda
if trace[idx] < threshold:
del trace[idx] # Remove negligible entries
# Add/update active features
for idx in active_features:
trace[idx] = trace.get(idx, 0) + 1 # Accumulating traceClearing Traces
When an episode ends, traces should be cleared. For continuing tasks, traces persist but naturally decay. There's also the question of whether to clear traces after exploratory actions:
Episodic Tasks
- • Clear traces at episode start
- • Or: let traces decay naturally
- • Depends on whether episodes are related
Continuing Tasks
- • Never explicitly clear traces
- • Traces decay by γλ each step
- • Natural forgetting of distant past
Numerical Stability
Several issues can arise in practice:
- Traces can grow large with replacing traces if the same state is visited many times
- Importance sampling ratios can be very large, causing trace explosion
- Very small traces waste computation without contributing meaningfully
Practical Solutions
- 1.Trace clipping: Cap trace magnitude at some maximum value
- 2.Importance sampling clipping: Cap ρ at a maximum (e.g., 1 or 2)
- 3.Threshold pruning: Remove trace entries below a threshold
- 4.Use double precision: Float32 may not be enough for long traces
Memory Considerations
The eligibility trace requires the same memory as the weight vector. For large function approximators (like neural networks), this can be significant:
- A 10M parameter network needs 10M trace values (40MB for float32)
- GPU memory becomes a constraint for very large models
- Sparse representations can dramatically reduce memory
Eligibility traces are less commonly used in deep RL partly due to memory constraints. However, approximations like truncated backprop through time or n-step returns provide similar benefits with bounded memory.
Conclusions
Eligibility traces are one of the most important and unifying concepts in reinforcement learning. They elegantly bridge Monte Carlo and TD methods, enable efficient credit assignment, and remain an active area of research.
Key Takeaways
- The λ-return provides a principled way to blend n-step returns
- Eligibility traces implement λ-returns efficiently via the backward view
- True Online TD(λ) achieves optimal λ-return behavior with O(d) complexity
- Off-policy traces require careful handling (importance sampling, control variates)
- Stable algorithms (GTD(λ), GQ(λ), Emphatic TD(λ)) address the deadly triad
When to Use Traces
Good Use Cases
- • Sparse rewards (need long-range credit)
- • Linear function approximation
- • When λ ≈ 0.9 outperforms λ = 0
- • Tabular or small state spaces
Consider Alternatives
- • Very large neural networks (memory)
- • Dense, frequent rewards
- • When n-step TD suffices
- • Highly off-policy settings
The Unified View
Eligibility traces reveal that TD and Monte Carlo methods are not opposites but endpoints on a continuum. By adjusting λ, we can smoothly interpolate between:
Start with λ = 0.9 as a reasonable default. If learning is unstable, try lower λ. If learning is slow despite long episodes, try higher λ. Always use True Online TD(λ) or Sarsa(λ) when possible—they're strictly better than the "approximate" versions.
Glossary
- λ-return (Lambda Return)
- A weighted average of all n-step returns, with weights (1-λ)λⁿ⁻¹. Provides a principled way to blend short and long-term targets.
- Eligibility Trace
- A vector that tracks which states or state-action pairs are "eligible" for learning. Decays over time and accumulates with visits.
- Accumulating Trace
- A trace that adds the feature vector on each visit: z ← γλz + x. Can grow without bound if states are revisited.
- Dutch Trace
- A modified trace that corrects for feature overlap: z ← γλz + (1 - αγλz⊤x)x. Used in True Online TD(λ).
- Replacing Trace
- A trace that resets to 1 (or the feature value) on each visit instead of accumulating. Prevents unbounded growth.
- Forward View
- Computing the λ-return by looking forward in time at all future rewards and states. Requires episode completion.
- Backward View
- Computing updates by looking backward at eligibility traces. Equivalent to forward view but can be done online.
- True Online TD(λ)
- An algorithm that exactly matches the offline λ-return with O(d) online updates. Uses Dutch traces.
- Control Variate
- A variance reduction technique that subtracts terms with zero expected value but non-zero correlation with the estimand.
- Tree-Backup(λ)
- An off-policy trace algorithm that weights traces by target policy probabilities instead of cutting them.
Formula Cheat Sheet
Core Formulas
λ-return:
Accumulating trace:
Dutch trace:
TD(λ) weight update:
Off-policy trace:
Tree-Backup trace:
Special Cases
λ = 0
Equivalent to TD(0), trace = current gradient only
λ = 1
Equivalent to Monte Carlo (in episodic tasks)
γ = 0
Immediate rewards only, trace resets each step
Linear features
∇v̂(s, w) = x(s), simplifies trace to z ← γλz + x
Common Mistakes
Not resetting traces at episode start
In episodic tasks, traces should be reset to zero at the beginning of each episode. Failing to do so can cause credit from one episode to leak into the next.
Using TD(λ) when True Online TD(λ) is available
Standard TD(λ) only approximates the λ-return. True Online TD(λ) achieves it exactly with the same computational cost. Always prefer True Online TD(λ).
Ignoring importance sampling in off-policy learning
Off-policy learning with traces requires importance sampling or algorithms like Tree-Backup(λ). Using standard traces off-policy leads to biased estimates.
Storing full traces with sparse features
With tile coding or other sparse representations, most trace entries are zero. Use a sparse data structure (dictionary, hash map) to save memory.
Using λ = 1 in continuing tasks
With λ = 1 and no episode boundaries, the λ-return becomes an infinite sum that may not converge. In continuing tasks, use λ < 1 or ensure γλ < 1.
Confusing trace decay γλ with just λ
The trace decays by γλ each step, not just λ. With γ = 0.99 and λ = 0.9, the effective decay is 0.891 per step, not 0.9.
For the Curious Mind
Why "Eligibility"?
The term "eligibility trace" comes from the idea that states become "eligible" for learning when they're visited. A recently visited state has high eligibility and receives more credit (or blame) when rewards arrive. The trace decays over time, reflecting our uncertainty about which past states were responsible for current outcomes.
Biological Inspiration
Eligibility traces have biological analogs. In neuroscience, synaptic eligibility traces are thought to mark synapses that were recently active, making them "eligible" for strengthening when a reward signal (like dopamine) arrives. This provides a biological mechanism for credit assignment across time delays.
Connection to Backpropagation Through Time
Eligibility traces are related to backpropagation through time (BPTT) in recurrent neural networks. Both propagate error signals backward through a sequence. BPTT does this exactly (but requires storing activations), while traces do it approximately (but with constant memory). This connection inspired algorithms like Real-Time Recurrent Learning.
The Geometric Series Magic
The weights (1-λ)λⁿ⁻¹ form a geometric series that sums to 1. This is crucial—it means the λ-return is a proper weighted average, not an arbitrary combination.
Open Research Questions
Active research areas include: (1) How to best use traces with deep neural networks given memory constraints, (2) Adaptive λ selection that changes based on experience, (3) Combining traces with experience replay (which breaks temporal structure), and (4) Theoretical understanding of when traces help versus hurt performance.