Chapter 12

Eligibility Traces

Unifying TD and Monte Carlo Methods

Eligibility traces provide one of the most important unifying mechanisms in reinforcement learning. They elegantly bridge the gap between TD(0) and Monte Carlo methods, allowing you to smoothly interpolate between the two extremes and often achieve better performance than either.

The λ-return

Before we can understand eligibility traces, we need to understand the λ-return—a clever way to combine all possible n-step returns into a single, weighted average. This is the theoretical foundation that makes TD(λ) so powerful.

The Core Problem

In n-step TD, we had to choose how many steps to look ahead. But what if we could use ALL possible n-step returns, from 1-step all the way to the full Monte Carlo return? That's exactly what the λ-return does—it's a weighted combination of all n-step returns.

The Compound Return

The λ-return combines all n-step returns using a clever weighting scheme. Each n-step return gets weight (1-λ)λⁿ⁻¹, which forms a geometric series that sums to 1:

G_t^\lambda \doteq (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_{t:t+n}

Breaking Down the Math

G_t^λThe λ-return starting at time t—our combined estimate
(1-λ)Normalization factor ensuring weights sum to 1
λⁿ⁻¹Weight for the n-step return—decays exponentially with n
G_{t:t+n}The n-step return (rewards plus bootstrapped value)

The λ Parameter

The parameter λ ∈ [0, 1] controls how we blend short and long-term returns:

λ = 0: Pure TD(0)

All weight goes to the 1-step return. Fast updates, high bias, low variance.

λ = 1: Monte Carlo

All weight goes to the complete return. No bias, high variance, needs episode end.

Intuition: The Weighting Scheme

Think of λ as a "decay rate for trust." With high λ, you trust longer returns more (looking further into the actual future). With low λ, you trust your current value estimates more (bootstrapping sooner). Most practical algorithms use λ between 0.8 and 0.95.

Weight Distribution

Here's how the weights distribute across n-step returns for different λ values:

n-step	λ=0	λ=0.5	λ=0.9
1-step	100%	50%	10%
2-step	0%	25%	9%
3-step	0%	12.5%	8.1%
∞-step (MC)	0%	~0%	~73%

Termination Handling

In episodic tasks, the λ-return needs special handling when an episode ends. After termination, all remaining weight goes to the actual Monte Carlo return:

G_t^\lambda = (1-\lambda) \sum_{n=1}^{T-t-1} \lambda^{n-1} G_{t:t+n} + \lambda^{T-t-1} G_t

Forward View vs Backward View

The λ-return represents the "forward view"—to compute it, you need to look forward in time to see all future rewards. This isn't practical for online learning. The magic of eligibility traces is that they provide an equivalent "backward view" that can be computed incrementally!

TD(λ)

TD(λ) is the algorithm that makes the λ-return practical for online learning. Instead of looking forward to compute returns, it looks backward using eligibility traces—a memory of which states contributed to the current situation.

Eligibility Traces

An eligibility trace is a temporary record of which states (or state-action pairs) have been visited recently. When a TD error occurs, credit or blame is assigned to all states in proportion to their eligibility—how recently and frequently they were visited.

The Eligibility Trace Vector

In function approximation, we maintain an eligibility trace vector z that has the same dimension as our weight vector w. It accumulates and decays over time:

z_t = \gamma \lambda z_{t-1} + \nabla \hat{v}(S_t, w_t)

Understanding the Trace Update

γλz_{t-1}Decay the old trace by γλ (discount × trace parameter)
∇v̂(S_t, w_t)Add the gradient of the value function at the current state
z_0 = 0Traces start at zero at the beginning of each episode

The TD(λ) Weight Update

The weight update uses the TD error δ and distributes it across all features according to their eligibility:

\delta_t = R_{t+1} + \gamma \hat{v}(S_{t+1}, w_t) - \hat{v}(S_t, w_t)

w_{t+1} = w_t + \alpha \delta_t z_t

Why This Works

The eligibility trace acts like a "bridge to the past." When you get a reward, the trace tells you which states led to this moment. States visited recently have high eligibility and get more credit. States visited long ago have low eligibility (due to decay) and get less credit. This is exactly what the λ-return does, but computed efficiently!

Semi-gradient TD(λ) Algorithm

Algorithm: Semi-gradient TD(λ)

Input: policy π, feature function x(s)
Parameters: step size α ∈ (0, 1], trace decay λ ∈ [0, 1]
Initialize: w arbitrarily, z = 0

For each episode:
    Initialize S (first state of episode)
    z ← 0  # Reset eligibility trace

    For each step of episode:
        A ← action given by π for S
        Take action A, observe R, S'

        δ ← R + γv̂(S', w) - v̂(S, w)  # TD error
        z ← γλz + ∇v̂(S, w)           # Update trace
        w ← w + αδz                   # Update weights

        S ← S'
    until S is terminal

Linear Function Approximation

For linear function approximation where v̂(s, w) = w⊤x(s), the gradient is simply the feature vector:

z_t = \gamma \lambda z_{t-1} + x(S_t)

Types of Traces

There are different types of eligibility traces:

Accumulating traces: Add 1 (or x) each visit, as shown above
Replacing traces: Reset to 1 on visit, don't accumulate
Dutch traces: A more sophisticated variant we'll see later

n-step Truncated λ-return Methods

The offline λ-return algorithm requires waiting until the end of the episode to make updates. The truncated λ-return provides a middle ground—it uses a horizon of n steps and can make updates before the episode ends.

Truncated λ-return

Instead of summing to infinity (or episode end), the truncated λ-return only looks ahead n steps. This allows for online updates while still capturing multi-step benefits.

The Truncated λ-return

G_{t:t+n}^\lambda \doteq (1-\lambda) \sum_{k=1}^{n-1} \lambda^{k-1} G_{t:t+k} + \lambda^{n-1} G_{t:t+n}

The key insight is that this becomes the full λ-return when n extends to the episode end, but can be computed with only n steps of lookahead.

Trade-offs

Advantages

• Can update during episode
• Bounded memory and computation
• Practical for continuing tasks

Disadvantages

• Updates delayed by n steps
• May miss long-range dependencies
• Extra complexity vs TD(λ)

Truncated TD(λ)

The update at time t uses information up to time t+n:

w_{t+n} = w_{t+n-1} + \alpha [G_{t:t+n}^\lambda - \hat{v}(S_t, w_{t+n-1})] \nabla \hat{v}(S_t, w_{t+n-1})

Choosing the Horizon n

The horizon n controls the trade-off between update delay and accuracy. Larger n means more accurate λ-return approximation but longer delays. In practice, n values of 3-10 often work well. The key is that n should be at least as large as the typical reward delay.

Redoing Updates: Online λ-return Algorithm

The offline λ-return algorithm waits until the end of an episode to make all updates. But what if we could get the same result while updating online? The online λ-return algorithm achieves this by "redoing" updates as new information arrives.

The Redoing Idea

Instead of making one update per state at the end, make updates online but redo earlier updates whenever you get new information. This produces exactly the same final weights as the offline algorithm, but with online updates.

The Online λ-return Algorithm

At each time step h, we maintain a sequence of weight vectors w₀ʰ, w₁ʰ, ..., wₕʰ that represent what the weights would be if the episode had ended at time h:

w_t^h = w_{t-1}^h + \alpha [G_{t-1:h}^\lambda - \hat{v}(S_{t-1}, w_{t-1}^h)] \nabla \hat{v}(S_{t-1}, w_{t-1}^h)

How Redoing Works

Think of it as maintaining a "sliding window" of weight histories:

1.At time h, start from w₀ʰ (same as previous final weights)
2.Redo all updates from t=1 to t=h using the truncated λ-return to horizon h
3.The final weight wₕʰ is used for the next step

Computational Cost

The naive online λ-return algorithm has O(h²) complexity per episode because you redo all previous updates at each step. This makes it impractical for long episodes. Fortunately, True Online TD(λ) solves this problem elegantly.

Why Bother with Redoing?

The online λ-return algorithm is primarily of theoretical interest. It shows that:

Online updates CAN produce the same result as offline λ-return
It provides a bridge to understanding True Online TD(λ)
It motivates the development of more efficient algorithms

True Online TD(λ)

True Online TD(λ) is one of the most important algorithms in this chapter. It achieves the same result as the online λ-return algorithm but with O(d) complexity per step (where d is the number of features)—the same as regular TD(λ)!

True Online TD(λ)

An efficient algorithm that produces exactly the same sequence of weight vectors as the online λ-return algorithm, but with linear-time updates. The key innovation is the Dutch trace, which modifies how eligibility traces accumulate.

The Dutch Trace

The magic of True Online TD(λ) lies in its modified eligibility trace:

z_t = \gamma \lambda z_{t-1} + (1 - \alpha \gamma \lambda z_{t-1}^\top x_t) x_t

What Makes It Different

Compare to the standard accumulating trace z = γλz + x:

Standard:Just adds the feature vector x
Dutch:Adds (1 - αγλz⊤x)x, a "corrected" feature vector

The correction term (1 - αγλz⊤x) ensures that repeated visits to similar states don't cause over-crediting.

Complete Algorithm

Algorithm: True Online TD(λ)

Input: policy π, linear features x(s) with d components
Parameters: α ∈ (0, 1], λ ∈ [0, 1]
Initialize: w ∈ ℝᵈ arbitrarily

For each episode:
    Initialize S, x ← x(S)
    z ← 0  # Eligibility trace (d-dimensional)
    V_old ← 0  # Previous value estimate

    For each step of episode:
        A ← action given by π for S
        Take action A, observe R, S'
        x' ← x(S')

        V ← w⊤x           # Current value
        V' ← w⊤x'         # Next value

        δ ← R + γV' - V   # TD error

        # Dutch trace update
        z ← γλz + (1 - αγλz⊤x)x

        # Weight update with extra correction
        w ← w + α(δ + V - V_old)z - α(V - V_old)x

        V_old ← V'
        x ← x'
        S ← S'
    until S is terminal

Why "True Online"?

True Online

• Updates are online (every step)
• Result equals offline λ-return
• O(d) per step complexity
• Theoretically optimal

Regular TD(λ)

• Updates are online
• Result differs from λ-return
• O(d) per step complexity
• Good approximation

When to Use True Online TD(λ)

Use True Online TD(λ) when you want the theoretical guarantees of the λ-return with efficient online updates. It's particularly valuable in domains where the extra complexity is worth the improved convergence properties, like robotics or game AI.

Dutch Traces in Monte Carlo Learning

The Dutch trace idea can also improve Monte Carlo learning. When λ = 1, TD(λ) becomes equivalent to Monte Carlo, but the Dutch trace still provides benefits by preventing over-crediting when states are revisited within an episode.

Why Dutch Traces Help MC

In standard MC with function approximation, if you visit similar states multiple times in an episode, their feature vectors accumulate and can cause excessive updates. Dutch traces automatically correct for this overlap.

The Problem with Accumulating Traces

Consider visiting states with similar (or identical) feature vectors. With accumulating traces, the eligibility grows proportionally to visits. With Dutch traces, the overlap is automatically accounted for:

z_t = z_{t-1} + (1 - \alpha z_{t-1}^\top x_t) x_t

Example: Revisiting a State

Suppose x = [1, 0, 0] and we visit this state twice with α = 0.1:

Trace Type	After 1st visit	After 2nd visit
Accumulating	[1, 0, 0]	[2, 0, 0]
Dutch	[1, 0, 0]	[1.9, 0, 0]

Dutch traces grow more slowly, preventing over-crediting for repeated visits.

Historical Note

The name "Dutch trace" comes from Harm van Seijen, who developed this technique. It's also sometimes called the "correction trace" because it corrects for the overlap between successive feature vectors.

Sarsa(λ)

Sarsa(λ) extends eligibility traces to control problems. Instead of learning state values, we learn action values q(s, a), and the traces now track state-action pairs.

Sarsa(λ)

The action-value version of TD(λ). It uses eligibility traces to credit all recently visited state-action pairs when a TD error occurs, allowing faster learning of the action-value function.

The Sarsa(λ) Update

The key equations for Sarsa(λ) parallel those of TD(λ), but for action values:

z_t = \gamma \lambda z_{t-1} + \nabla \hat{q}(S_t, A_t, w_t)

\delta_t = R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}, w_t) - \hat{q}(S_t, A_t, w_t)

w_{t+1} = w_t + \alpha \delta_t z_t

Algorithm

Algorithm: Semi-gradient Sarsa(λ)

Input: feature function x(s, a)
Parameters: α ∈ (0, 1], λ ∈ [0, 1], ε > 0
Initialize: w ∈ ℝᵈ arbitrarily

For each episode:
    Initialize S
    A ← ε-greedy action for S based on q̂(S, ·, w)
    z ← 0

    For each step of episode:
        Take action A, observe R, S'

        if S' is terminal:
            δ ← R - q̂(S, A, w)
            w ← w + αδz
            break

        A' ← ε-greedy action for S' based on q̂(S', ·, w)

        δ ← R + γq̂(S', A', w) - q̂(S, A, w)
        z ← γλz + ∇q̂(S, A, w)
        w ← w + αδz

        S ← S', A ← A'

True Online Sarsa(λ)

Just as with TD(λ), there's a "true online" version of Sarsa(λ) that uses Dutch traces:

z_t = \gamma \lambda z_{t-1} + (1 - \alpha \gamma \lambda z_{t-1}^\top x_t) x_t

Sarsa(λ) in Practice

Sarsa(λ) often outperforms Sarsa(0) because it can assign credit to actions taken several steps before a reward. This is especially important in sparse reward problems where rewards come infrequently. Values of λ between 0.8 and 0.95 typically work well.

Mountain Car Example

In the Mountain Car problem, the agent must build momentum to reach the goal. With Sarsa(0), learning is slow because only the final action before reaching the goal gets credit. With Sarsa(λ), all the actions that built momentum receive credit, leading to much faster learning.

Experiments show Sarsa(λ=0.9) can solve Mountain Car in ~200 episodes, while Sarsa(0) may take ~1000+ episodes.

Variable λ and γ

So far we've treated λ and γ as fixed constants. But in some cases, it's useful to let them vary based on state or time. This leads to more general algorithms with state-dependent bootstrapping and termination.

Variable Parameters

Instead of fixed λ and γ, we can use functions λ(s) and γ(s) that depend on the current state. This allows the algorithm to bootstrap more in some states and less in others, adapting to the structure of the problem.

Generalized λ-return

With state-dependent parameters, the λ-return becomes:

G_t^\lambda = R_{t+1} + \gamma_{t+1} \big( (1-\lambda_{t+1}) \hat{v}(S_{t+1}, w) + \lambda_{t+1} G_{t+1}^\lambda \big)

Understanding Variable Parameters

λ(s)Controls bootstrapping: low λ = bootstrap more, high λ = use actual returns more
γ(s)Controls termination: γ = 0 acts like episode end, γ = 1 means full discounting

Use Cases for Variable Parameters

Variable γ

Model continuing tasks with "soft" termination, handle safety-critical states differently, or implement options/temporally abstract actions.

Variable λ

Bootstrap more in well-understood states, use longer traces when value estimates are uncertain, adapt to state-dependent noise.

Modified Trace Update

With variable parameters, the eligibility trace update becomes:

z_t = \gamma_t \lambda_t z_{t-1} + \nabla \hat{v}(S_t, w_t)

Practical Considerations

Variable parameters add complexity but can significantly improve performance in some domains. They're most useful when you have domain knowledge about which states need more or less bootstrapping. Without such knowledge, fixed parameters often work well.

Off-policy Traces with Control Variates

Extending eligibility traces to off-policy learning is challenging. The behavior policy (what we actually do) differs from the target policy (what we want to learn about).Importance sampling with control variates provides a solution.

The Off-policy Challenge

In off-policy learning, actions come from behavior policy b but we want to learn the value of target policy π. Naive importance sampling can have very high variance. Control variates reduce this variance by subtracting terms that have zero expected value.

Per-decision Importance Sampling

The importance sampling ratio for a single action is:

\rho_t = \frac{\pi(A_t | S_t)}{b(A_t | S_t)}

This ratio reweights each step to account for the policy difference.

Off-policy Eligibility Traces

The eligibility trace for off-policy learning incorporates importance sampling:

z_t = \gamma \lambda \rho_t z_{t-1} + \nabla \hat{v}(S_t, w_t)

The Variance Problem

With importance sampling, if ρ is large (target much more likely than behavior), the trace can explode. If ρ is small (target unlikely), the trace quickly decays to zero.

This leads to high variance—some updates are huge, others are negligible.

Control Variates

Control variates reduce variance by adding terms that have zero expected value but are correlated with the high-variance terms. The key insight is:

\mathbb{E}_b[\rho_t \hat{q}(S_t, A_t, w)] = \mathbb{E}_\pi[\hat{q}(S_t, A_t, w)]

By subtracting v̂(Sₜ, w) and adding it back weighted appropriately, we can reduce variance while maintaining the same expected update:

\delta_t^{CV} = \rho_t (R_{t+1} + \gamma \hat{v}(S_{t+1}, w) - \hat{q}(S_t, A_t, w)) + \hat{v}(S_t, w) - \rho_t \hat{q}(S_t, A_t, w)

Intuition for Control Variates

Think of control variates as "centering" the updates. Instead of updating by ρ × (full term), we update by ρ × (term - baseline) + baseline. The baseline reduces variance because it's not multiplied by the potentially large ρ.

Watkins's Q(λ) to Tree-Backup(λ)

There are several ways to extend Q-learning to use eligibility traces. Watkins's Q(λ)was the first attempt, but Tree-Backup(λ) provides a more principled solution that doesn't require importance sampling.

The Q(λ) Challenge

Q-learning is off-policy: we take exploratory actions but learn about the greedy policy. When using traces, we need to handle the mismatch between what we did (possibly exploratory) and what the greedy policy would have done.

Watkins's Q(λ)

The original Q(λ) by Watkins takes a simple approach: cut the trace whenever an exploratory (non-greedy) action is taken.

z_t = \begin{cases} \gamma \lambda z_{t-1} + \nabla \hat{q}(S_t, A_t, w) & \text{if } A_{t-1} = \arg\max_a \hat{q}(S_{t-1}, a, w) \\ \nabla \hat{q}(S_t, A_t, w) & \text{otherwise} \end{cases}

Problem with Watkins's Q(λ)

With ε-greedy exploration, exploratory actions happen with probability ε at each step. This means traces are cut frequently, limiting the effective trace length. With ε = 0.1, the average trace length before cutting is only 10 steps!

Peng's Q(λ)

Peng's Q(λ) attempts to fix this by not cutting traces, but it's not truly off-policy and doesn't converge to the optimal policy in general.

Tree-Backup(λ)

Tree-Backup(λ) provides a principled solution. Instead of cutting traces, it weights updates by the probability of taking actions under the target policy:

z_t = \gamma \lambda \pi(A_t | S_t) z_{t-1} + \nabla \hat{q}(S_t, A_t, w)

The TD error also changes to account for the expected value over all actions:

\delta_t = R_{t+1} + \gamma \sum_a \pi(a | S_{t+1}) \hat{q}(S_{t+1}, a, w) - \hat{q}(S_t, A_t, w)

Why Tree-Backup Works

Tree-Backup doesn't cut traces but gracefully decays them based on how likely the target policy was to take each action. Greedy actions (π(a|s) = 1) maintain the full trace. Exploratory actions (π(a|s) = 0 for non-greedy) still contribute but decay the trace for earlier states.

Watkins's Q(λ)

• Binary trace cutting
• Limited effective trace length
• Simple implementation

Tree-Backup(λ)

• Smooth trace weighting
• Full use of traces
• No importance sampling needed

Stable Off-policy Methods with Traces

Combining off-policy learning, function approximation, and eligibility traces can lead to instability—the "deadly triad." This section presents algorithms designed to be stable even with all three elements.

The Deadly Triad with Traces

The deadly triad (function approximation + bootstrapping + off-policy) is even more dangerous with eligibility traces. Traces can amplify instabilities by propagating errors across many steps. Stable algorithms are essential for practical use.

GTD(λ) - Gradient TD with Traces

GTD(λ) extends the stable Gradient-TD methods to use eligibility traces:

w_{t+1} = w_t + \alpha (\delta_t z_t - \gamma (1 - \lambda) (z_t^\top v_t) x_{t+1})

v_{t+1} = v_t + \beta (\delta_t z_t - (z_t^\top v_t) z_t)

GTD(λ) Components

wPrimary weights for value estimation
vAuxiliary weights for stability (estimates expected TD error)
zEligibility trace (with importance sampling for off-policy)

GQ(λ) - For Action Values

GQ(λ) is the action-value version, suitable for control problems:

w_{t+1} = w_t + \alpha \rho_t (\delta_t z_t - \gamma (1-\lambda) (z_t^\top v_t) \bar{x}_{t+1})

HTD(λ) - Hybrid TD

HTD(λ) combines the benefits of GTD(λ) stability with TD(λ) efficiency when on-policy:

z_t^h = \rho_t (\gamma \lambda z_{t-1}^h + x_t) + (1 - \rho_t) \gamma \lambda z_{t-1}^h

When to Use Each Algorithm

GTD(λ): When you need guaranteed stability, don't mind the extra computation
GQ(λ): For off-policy control problems with traces
HTD(λ): When you want stability but expect mostly on-policy behavior

Emphatic TD(λ)

Emphatic TD(λ) uses emphasis to reweight updates, achieving stability through a different mechanism:

M_t = \lambda I_t + (1 - \lambda) F_t

F_t = \gamma \rho_{t-1} F_{t-1} + I_t

Emphatic Methods

Emphatic TD(λ) allows you to specify which states you care about (interest) and automatically handles the off-policy correction. It's particularly useful when you have prior knowledge about which states matter most.

Implementation Issues

Implementing eligibility traces in practice requires attention to several details. This section covers common issues and best practices for efficient, correct implementations.

Sparse Traces

Efficiency with Sparse Features

When using sparse features (like tile coding), most trace entries are zero. Don't store the full trace vector! Instead, keep track of only the non-zero entries and decay/remove them when they become negligible.

Sparse Trace Implementation

# Python pseudocode for sparse traces
trace = {}  # Dictionary: feature_index -> eligibility

def update_trace(active_features, gamma_lambda, threshold=1e-5):
    # Decay all existing traces
    for idx in list(trace.keys()):
        trace[idx] *= gamma_lambda
        if trace[idx] < threshold:
            del trace[idx]  # Remove negligible entries

    # Add/update active features
    for idx in active_features:
        trace[idx] = trace.get(idx, 0) + 1  # Accumulating trace

Clearing Traces

When an episode ends, traces should be cleared. For continuing tasks, traces persist but naturally decay. There's also the question of whether to clear traces after exploratory actions:

Episodic Tasks

• Clear traces at episode start
• Or: let traces decay naturally
• Depends on whether episodes are related

Continuing Tasks

• Never explicitly clear traces
• Traces decay by γλ each step
• Natural forgetting of distant past

Numerical Stability

Avoiding Numerical Issues

Several issues can arise in practice:

Traces can grow large with replacing traces if the same state is visited many times
Importance sampling ratios can be very large, causing trace explosion
Very small traces waste computation without contributing meaningfully

Practical Solutions

1.Trace clipping: Cap trace magnitude at some maximum value
2.Importance sampling clipping: Cap ρ at a maximum (e.g., 1 or 2)
3.Threshold pruning: Remove trace entries below a threshold
4.Use double precision: Float32 may not be enough for long traces

Memory Considerations

The eligibility trace requires the same memory as the weight vector. For large function approximators (like neural networks), this can be significant:

A 10M parameter network needs 10M trace values (40MB for float32)
GPU memory becomes a constraint for very large models
Sparse representations can dramatically reduce memory

Deep RL and Traces

Eligibility traces are less commonly used in deep RL partly due to memory constraints. However, approximations like truncated backprop through time or n-step returns provide similar benefits with bounded memory.

Conclusions

Eligibility traces are one of the most important and unifying concepts in reinforcement learning. They elegantly bridge Monte Carlo and TD methods, enable efficient credit assignment, and remain an active area of research.

Key Takeaways

The λ-return provides a principled way to blend n-step returns
Eligibility traces implement λ-returns efficiently via the backward view
True Online TD(λ) achieves optimal λ-return behavior with O(d) complexity
Off-policy traces require careful handling (importance sampling, control variates)
Stable algorithms (GTD(λ), GQ(λ), Emphatic TD(λ)) address the deadly triad

When to Use Traces

Good Use Cases

• Sparse rewards (need long-range credit)
• Linear function approximation
• When λ ≈ 0.9 outperforms λ = 0
• Tabular or small state spaces

Consider Alternatives

• Very large neural networks (memory)
• Dense, frequent rewards
• When n-step TD suffices
• Highly off-policy settings

The Unified View

Eligibility traces reveal that TD and Monte Carlo methods are not opposites but endpoints on a continuum. By adjusting λ, we can smoothly interpolate between:

TD(0)

λ = 0

High bias, low variance

← λ →

λ = 1

No bias, high variance

Practical Advice

Start with λ = 0.9 as a reasonable default. If learning is unstable, try lower λ. If learning is slow despite long episodes, try higher λ. Always use True Online TD(λ) or Sarsa(λ) when possible—they're strictly better than the "approximate" versions.

Glossary

λ-return (Lambda Return): A weighted average of all n-step returns, with weights (1-λ)λⁿ⁻¹. Provides a principled way to blend short and long-term targets.
Eligibility Trace: A vector that tracks which states or state-action pairs are "eligible" for learning. Decays over time and accumulates with visits.
Accumulating Trace: A trace that adds the feature vector on each visit: z ← γλz + x. Can grow without bound if states are revisited.
Dutch Trace: A modified trace that corrects for feature overlap: z ← γλz + (1 - αγλz⊤x)x. Used in True Online TD(λ).
Replacing Trace: A trace that resets to 1 (or the feature value) on each visit instead of accumulating. Prevents unbounded growth.
Forward View: Computing the λ-return by looking forward in time at all future rewards and states. Requires episode completion.
Backward View: Computing updates by looking backward at eligibility traces. Equivalent to forward view but can be done online.
True Online TD(λ): An algorithm that exactly matches the offline λ-return with O(d) online updates. Uses Dutch traces.
Control Variate: A variance reduction technique that subtracts terms with zero expected value but non-zero correlation with the estimand.
Tree-Backup(λ): An off-policy trace algorithm that weights traces by target policy probabilities instead of cutting them.

Formula Cheat Sheet

Core Formulas

λ-return:

G_t^\lambda = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_{t:t+n}

Accumulating trace:

z_t = \gamma \lambda z_{t-1} + \nabla \hat{v}(S_t, w_t)

Dutch trace:

z_t = \gamma \lambda z_{t-1} + (1 - \alpha \gamma \lambda z_{t-1}^\top x_t) x_t

TD(λ) weight update:

w_{t+1} = w_t + \alpha \delta_t z_t

Off-policy trace:

z_t = \gamma \lambda \rho_t z_{t-1} + \nabla \hat{v}(S_t, w_t)

Tree-Backup trace:

z_t = \gamma \lambda \pi(A_t | S_t) z_{t-1} + \nabla \hat{q}(S_t, A_t, w_t)

Special Cases

λ = 0

Equivalent to TD(0), trace = current gradient only

λ = 1

Equivalent to Monte Carlo (in episodic tasks)

γ = 0

Immediate rewards only, trace resets each step

Linear features

∇v̂(s, w) = x(s), simplifies trace to z ← γλz + x

Common Mistakes

Not resetting traces at episode start

In episodic tasks, traces should be reset to zero at the beginning of each episode. Failing to do so can cause credit from one episode to leak into the next.

Using TD(λ) when True Online TD(λ) is available

Standard TD(λ) only approximates the λ-return. True Online TD(λ) achieves it exactly with the same computational cost. Always prefer True Online TD(λ).

Ignoring importance sampling in off-policy learning

Off-policy learning with traces requires importance sampling or algorithms like Tree-Backup(λ). Using standard traces off-policy leads to biased estimates.

Storing full traces with sparse features

With tile coding or other sparse representations, most trace entries are zero. Use a sparse data structure (dictionary, hash map) to save memory.

Using λ = 1 in continuing tasks

With λ = 1 and no episode boundaries, the λ-return becomes an infinite sum that may not converge. In continuing tasks, use λ < 1 or ensure γλ < 1.

Confusing trace decay γλ with just λ

The trace decays by γλ each step, not just λ. With γ = 0.99 and λ = 0.9, the effective decay is 0.891 per step, not 0.9.

For the Curious Mind

Why "Eligibility"?

The term "eligibility trace" comes from the idea that states become "eligible" for learning when they're visited. A recently visited state has high eligibility and receives more credit (or blame) when rewards arrive. The trace decays over time, reflecting our uncertainty about which past states were responsible for current outcomes.

Biological Inspiration

Eligibility traces have biological analogs. In neuroscience, synaptic eligibility traces are thought to mark synapses that were recently active, making them "eligible" for strengthening when a reward signal (like dopamine) arrives. This provides a biological mechanism for credit assignment across time delays.

Connection to Backpropagation Through Time

Eligibility traces are related to backpropagation through time (BPTT) in recurrent neural networks. Both propagate error signals backward through a sequence. BPTT does this exactly (but requires storing activations), while traces do it approximately (but with constant memory). This connection inspired algorithms like Real-Time Recurrent Learning.

The Geometric Series Magic

The weights (1-λ)λⁿ⁻¹ form a geometric series that sums to 1. This is crucial—it means the λ-return is a proper weighted average, not an arbitrary combination.

(1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} = (1-\lambda) \cdot \frac{1}{1-\lambda} = 1

Open Research Questions

Active research areas include: (1) How to best use traces with deep neural networks given memory constraints, (2) Adaptive λ selection that changes based on experience, (3) Combining traces with experience replay (which breaks temporal structure), and (4) Theoretical understanding of when traces help versus hurt performance.

Chapter 11: Off-policy Methods Back to Home