On-policy Prediction with Approximation
Scale RL to large (even continuous) state spaces using function approximation. Learn to generalize from limited experience.
Problem: Tables don't work when you have millions of states.
Solution: Use a function v̂(s, w) with learnable weights w.
How: Gradient descent — take small steps to reduce error.
Options: Linear functions (simple, guaranteed to work) or neural nets (powerful but finicky).
Key Win: Learning about one state helps us predict values for similar states we've never seen!
What if there are too many states to remember them all?
So far, we've stored a separate value for each state — a table. But what if you have millions of states? Or continuous states like positions and velocities? The solution: function approximation. Instead of memorizing every value, we learn a function (like a neural network) that can estimate values for ANY state — even ones we've never seen before!
Think of it Like This: GPS vs. Memorizing Every Road
Tabular RL is like memorizing the exact driving time from every address to every other address in your city. For a small town with 100 locations, that's 10,000 entries — doable! But for a big city with 1 million locations? That's 1 trillion entries. Impossible!
Function Approximation is like having a GPS that learned general rules: "highways are faster," "rush hour is slow," "shorter distance usually means less time." The GPS has never been to YOUR exact location before, but it can estimate your drive time using these learned patterns!
Key insight: Instead of memorizing every answer, we learn the patterns that let us guess good answers for new situations.
Generalization
Learn once, apply everywhere
Neural Networks
Deep RL foundations
Feature Engineering
Tile coding, RBFs, & more
The Big Picture: From Tables to Functions
Tabular Methods (Part I)
Store V(s) or Q(s,a) in a table. One entry per state (or state-action pair).
Problem: Doesn't scale to large/continuous spaces!
Function Approximation (Part II)
Learn v̂(s, w) where w is a vector of parameters (weights).
Benefit: Generalize to unseen states!
This chapter's focus: Prediction — estimating vπ(s) using function approximation. Same policy throughout (on-policy).
See the Difference: Table vs. Function
Tabular Method
Every state has its own box
25 states = 25 values to store
1 million states = 1 million values!
∞ states = impossible!
Function Approximation
Learn a formula with just a few numbers
Only 3 weights to learn!
3 weights work for ALL states
Even states we've never seen!
Works with continuous states too!
The Magic: Instead of storing millions of values, we learn a small set of weights that can compute the value for any state!
How Does the Learning Actually Work?
See State
Agent observes current state s
Make Guess
Compute v̂(s) using current weights
Get Feedback
See actual reward & next state
Find Error
Compare guess vs. reality
Adjust Weights
Nudge weights to reduce error
Repeat thousands of times → weights get better and better!
Gradient Descent: Rolling Downhill
Finding the best weights is like finding the lowest point in a valley
The Error Landscape
Error
↑
╱╲ ╱╲ ← High error (bad weights)
╱ ╲ ╱ ╲
╱ ╲ ╱ ╲
╲╱ ← Low error (good weights!)
●
───────┴───────→ Weights
Each point represents different weight values.
Height = how wrong our predictions are.
How We Find the Bottom
Start Somewhere
Random weights = probably high on a hill
Feel the Slope
Gradient tells us which way is "down"
Take a Small Step
Move weights slightly downhill (α controls step size)
Repeat!
Eventually reach the valley bottom = best weights
Key Terms You'll Need to Know
What is a "Weight" or "Parameter"?
A weight (written as w) is a number that our algorithm learns and adjusts. Think of it like a "dial" or "knob" that controls how the function behaves.
Example: If predicting house prices, one weight might control "how much does an extra bedroom add to price?" If w₁ = 50,000, then each bedroom adds $50,000 to our estimate.
What is a "Feature"?
A feature (written as x) is a number that describes something about a state. It's the "input" to our function that we use to make predictions.
Example: In a video game, features might be: distance to enemy (23 pixels), health remaining (75%), ammo count (12 bullets). These numbers describe the current state.
What is a "Gradient"?
The gradient (written as ∇) tells you which direction is "downhill" — the direction that reduces your error the fastest.
Analogy: Imagine you're blindfolded on a hilly landscape trying to find the lowest point. The gradient is like feeling which way the ground slopes under your feet — it points downhill!
What is a "Dot Product"?
A dot product (w⊤x) multiplies matching pairs and adds them up: w₁×x₁ + w₂×x₂ + ... = single number.
Example: w = [2, 3] and x = [4, 5]. Dot product = 2×4 + 3×5 = 8 + 15 = 23
Value-function Approximation
Why tables aren't enough, and what to do about it.
The curse of large state spaces!
Imagine trying to play backgammon. There are about 1020 possible positions. Even if you visited a million states per second, it would take longer than the age of the universe to see them all once! We need a way to generalize — learn something useful from one state and apply it to similar states we've never seen.
Instead of storing a separate value for each state, we learn an approximate value function:
Term Breakdown:
Why Function Approximation?
Scalability
- • Handle millions or infinite states
- • Work with continuous state spaces
- • Memory grows with parameters, not states
Generalization
- • Learn from one state, apply to similar states
- • Make reasonable predictions for unseen states
- • This is the key power of approximation!
We can view each update as a training example for supervised learning:
Input: State s
Target: Some estimate of vπ(s) (like a return or TD target)
But there's a crucial difference from typical supervised learning: the target values are nonstationary (they change as we learn) and we don't have the luxury of multiple passes over a fixed training set.
For the Curious Mind
This is the Foundation of Deep RL!
When you hear about "Deep RL" or "DQN" — they're using neural networks as function approximators. The concepts in this chapter (especially gradient descent on value functions) are exactly what powers AlphaGo, OpenAI Five, and modern game-playing AI!
The Deadly Triad Warning
Combining function approximation + bootstrapping + off-policy learning can cause divergence — the weights explode to infinity! This chapter focuses on on-policy learning, which is safer. We'll address the challenges in later chapters.
The Prediction Objective (V̄E)
What exactly are we trying to minimize?
You can't be perfect everywhere!
With function approximation, we usually have fewer parameters than states. This means we can't get the value exactly right for every state — we have to make tradeoffs. Some states will have more error than others. The question is: which states should we prioritize?
Our objective is to minimize the weighted mean squared error between our approximation and the true value:
Term Breakdown:
What is μ(s)?
The weighting μ(s) represents how much we care about accuracy at each state. A natural choice: the on-policy distribution — the fraction of time spent in each state while following policy π.
Intuition: We visit some states more often than others. It makes sense to be more accurate in states we actually encounter frequently!
Note: For continuing tasks, μ is the stationary distribution. For episodic tasks, it's the distribution of states encountered in episodes.
In general, we cannot find w that gives zero error everywhere. We must make tradeoffs:
- •Reducing error in one state might increase error in another
- •The objective V̄E tells us which tradeoffs to make
- •States with higher μ(s) get more attention
For the Curious Mind
Why squared error and not absolute error?
Squared error is differentiable everywhere (important for gradient methods) and penalizes large errors more heavily. But other objectives like Huber loss are sometimes used in practice, especially when dealing with outliers!
Stochastic Gradient and Semi-gradient Methods
The core algorithm for learning with approximation.
Gradient descent: roll downhill!
Imagine the error V̄E as a landscape of hills and valleys, where each point represents different weight values. We want to find the lowest point (minimum error). Gradient descentworks like rolling a ball downhill — we compute which direction is "down" (the gradient) and take a small step that way. Do this repeatedly and we reach a minimum!
Given a training example (state St with target value Ut), update weights:
Term Breakdown:
Understanding the Gradient
The gradient ∇v̂(s, w) is a vector that tells us how sensitive the value estimate is to each weight. If we increase weight wi by a tiny amount, how much does v̂(s, w) change?
Intuition: The gradient points in the direction that most increases v̂. To decrease the error, we move in the direction of error × gradient.
When the target Ut = Gt (the actual return), we get gradient Monte Carlo:
This is a true gradient method — it converges to a local minimum of V̄E. The return Gt is an unbiased estimate of vπ(St).
Worked Example: One SGD Update Step-by-Step
Let's trace through exactly what happens in ONE update. Suppose we have a simple linear function with 2 weights.
State features: x(s) = [3, 4]
Target (true return): G = 15
Step size: α = 0.1
Result: New weights are [3.5, 3.0]. Now v̂(s) = 3.5×3 + 3.0×4 = 10.5 + 12 = 22.5. That's closer to 15... wait, we overshot! That's okay — with a smaller step size, we'd be more cautious and eventually converge.
Semi-gradient TD(0)
When we use a bootstrapped target (like TD), we get a semi-gradient method:
Why "semi-gradient"?
The target Rt+1 + γv̂(St+1, w) also depends on w, but we ignore this dependence! We don't take the gradient through the target. This makes it not a true gradient of any objective function, but it works well in practice.
Worked Example: Semi-gradient TD(0) Update
Let's see how TD learning works step-by-step. The agent is in state S, takes an action, gets reward R, and arrives in state S'.
Current state features: x(S) = [1, 0]
Next state features: x(S') = [0, 1]
Reward received: R = 5
Discount factor: γ = 0.9, Step size: α = 0.1
Notice: Only w₁ changed! That's because x(S) = [1, 0] — only feature 1 was "active" in state S. This is why linear methods with sparse features are efficient: we only update the weights for features that were present.
MC vs. TD with Approximation
Gradient MC
- • True gradient method
- • Converges to local minimum
- • Unbiased but high variance
- • Must wait for episode end
Semi-gradient TD
- • Not true gradient (semi)
- • Converges for linear (on-policy)
- • Biased but lower variance
- • Learns online, step-by-step
For the Curious Mind
The State Aggregation Special Case
State aggregation (grouping similar states together) is actually a special case of function approximation! Each group has one weight, and all states in that group share the same value. The gradient is 1 for the active group and 0 otherwise.
Linear Methods
The simplest and most well-understood approximation.
The value is a weighted sum of features!
In linear function approximation, we represent each state as a vector of features, and the value is just the dot product of features with weights. Simple, interpretable, and guaranteed to converge (with TD on-policy)!
The approximate value is a linear combination of features:
Term Breakdown:
Visual: How Linear Value Function Works
Features x(s)
Weights w
Dot Product
3×2.0 = 6.0
2×1.5 = 3.0
1×0.5 = 0.5
Sum = 9.5
Value
v̂(s) = 9.5
The value is simply: (feature₁ × weight₁) + (feature₂ × weight₂) + (feature₃ × weight₃)
Why Linear is Special
For linear approximation, the gradient is simply the feature vector:
This makes the update rule beautifully simple:
Where δt = Rt+1 + γv̂(St+1, w) - v̂(St, w) is the TD error. We simply add the TD error times each feature to the corresponding weight!
For on-policy TD(0) with linear approximation, the weights converge to a unique solution:
where A and b are expectations over the on-policy distribution. This "TD fixed point" is guaranteed to exist and be unique (under standard conditions).
Error Bound for Linear TD
The TD fixed point isn't optimal, but it's guaranteed to be close:
The TD solution is at most 1/(1-γ) times worse than the best possible. For γ = 0.9, that's at most 10× the minimum error — not bad for getting updates online!
For the Curious Mind
Linear is Surprisingly Powerful!
With good feature engineering (like tile coding), linear methods can approximate very complex value functions. They're also interpretable — you can look at the weights and understand what features matter most! Many practical RL systems still use linear approximation for its simplicity and convergence guarantees.
Feature Construction for Linear Methods
How to represent states as feature vectors.
Features are the foundation!
Linear methods can only learn what we give them to work with. If we use bad features, no amount of learning will help. If we use good features, learning is easy! This section explores different ways to construct useful features.
Think of Features Like a Questionnaire
Imagine you want to predict someone's health. You could ask different types of questions:
Raw: "What's your age?" (just the number: 35)
Binned: "Are you 20-30? 30-40? 40-50?" (yes/no for each bin)
Polynomial: "Age?" + "Age²?" (captures non-linear effects)
Same state, different feature representations:
Different features let the linear model learn different patterns!
Visual Guide: Feature Construction Methods
Polynomial
x, x², x³...
Smooth curves
Fourier
sin, cos waves
Periodic patterns
Tile Coding
Grid regions
Local + fast
RBF
Gaussian bumps
Smooth local
Most popular for RL: Tile coding — simple, fast, and works great!
Polynomials
Use polynomial combinations of state variables: x, x², xy, x²y, etc.
Good for smooth functions, but number of features grows exponentially with dimension.
Fourier Basis
Use sine and cosine functions at different frequencies.
Great for periodic patterns. Can represent any function with enough terms.
Cover the state space with overlapping regions (receptive fields). Each feature indicates whether the state is in that region:
Key insight: The size and shape of regions controls generalization. Large regions → generalize broadly. Small regions → generalize locally.
Tile Coding (The Practical Workhorse)
Tile coding uses multiple overlapping grids (tilings). Each tiling partitions the state space into tiles. The feature vector indicates which tile is active in each tiling.
Example: With 8 tilings, each state activates exactly 8 features (one per tiling). The tilings are offset from each other, so nearby states share most tiles.
- ✓Constant memory: Fixed number of tilings × tiles per tiling
- ✓Fast updates: Only active tiles need updating
- ✓Good generalization: Nearby states share tiles
Pro tip: With n tilings, set α = α0/n so that the initial step size is α0 (since n features are active).
Worked Example: Tile Coding for Mountain Car
Imagine we're learning to drive a car up a mountain. The state has 2 dimensions: position (−1.2 to 0.5) and velocity (−0.07 to 0.07).
Setup: 2 Tilings, 4×4 tiles each
Tiling 1 (no offset)
State falls in tile 5
Tiling 2 (offset by half tile)
State falls in tile 10
Key Insight: Even though there are 32 weights (16 tiles × 2 tilings), only 2 are active at any time! Updates are fast: just add δ to 2 weights. And nearby states share tiles, so learning generalizes automatically.
Radial Basis Functions (RBFs)
Instead of binary (0/1) features, RBFs give continuous values based on distance from centers:
Each feature is a "bump" centered at ci with width σi. States near the center activate the feature strongly; distant states activate it weakly.
For the Curious Mind
The Kernel Trick Connection
RBFs are related to kernel methods in machine learning. The feature space implicitly defined by an RBF kernel is infinite-dimensional! This connects to the powerful ideas of kernel-based learning (covered later in this chapter).
Tile Coding in Practice
Rich Sutton provides free tile coding software at his website. It's been used successfully in robotics, game playing, and many other domains. Simple to implement and surprisingly effective!
Selecting Step-Size Parameters Manually
How to choose α for stable learning.
Finding the Goldilocks zone!
The step-size α is crucial. Too large → unstable learning, values oscillate wildly. Too small → painfully slow learning. We need to find the "just right" zone.
Visual: The Effect of Step Size (α)
α Too Small
Learns too slowly
α = 0.001
α Just Right ✓
Converges smoothly
α = 0.1
α Too Large
Oscillates / diverges!
α = 1.0
Rule of thumb: Start with α = 0.1 and adjust. If unstable, decrease. If too slow, increase.
For tile coding with n tilings, a good starting point is:
This ensures that one update brings the value exactly to the target (for the case where all n active features are new). In practice, you may need to tune from there.
General Guideline
For linear methods, the step-size should satisfy:
where the expectation is over states sampled from the on-policy distribution. This is the expected squared norm of the feature vector.
- •Values can oscillate wildly around the true values
- •With bootstrapping (TD), weights can diverge to infinity!
- •Start conservative and increase if learning is too slow
Nonlinear Function Approximation: Artificial Neural Networks
The power of deep learning for RL.
When linear isn't enough!
Linear methods require us to hand-craft good features. Neural networkscan learn their own features! They can approximate almost any function, making them incredibly powerful for complex problems. This is the "deep" in Deep RL!
Think of it Like This: A Factory Assembly Line
Linear methods are like a single worker who looks at raw materials and directly estimates the final product's value. Simple, but they can only recognize obvious patterns.
Neural networks are like a factory with multiple assembly lines:
(pixels, positions)
(finds edges)
(finds shapes)
(value: 42.5)
Each layer transforms the input into something more useful. The network automatically learns what features matter — we don't have to design them by hand!
A neural network computes v̂(s, w) by passing the state through layers of neurons:
Input layer: Raw state features s
Hidden layers: Learn representations (features) automatically
Output layer: The value estimate v̂(s)
Each layer applies: y = σ(Wx + b) where σ is an activation function (ReLU, tanh, etc.)
Visual: What a Neural Network Looks Like
Input
State
Hidden 1
Features
Hidden 2
Abstract
Output
Value
Input
Raw state (position, velocity...)
Hidden Layers
Learned features (automatic!)
Output
Single value estimate
Deep RL Revolution
Neural networks enable RL to work directly with raw inputs like images:
Advantages
- • Learn features automatically
- • Handle high-dimensional inputs
- • Universal function approximators
Challenges
- • No convergence guarantees
- • Sensitive to hyperparameters
- • Sample inefficient
The gradient ∇v̂(s, w) is computed using backpropagation:
- 1.Forward pass: Compute v̂(s) by passing s through the network
- 2.Compute error: δ = target - v̂(s)
- 3.Backward pass: Propagate gradients backward through layers
- 4.Update weights: w ← w + α δ ∇v̂(s)
For the Curious Mind
DQN: The Breakthrough
DeepMind's DQN (2015) combined Q-learning with deep neural networks to play Atari games directly from pixels! Key innovations: experience replay (storing past transitions) and target networks (freezing the target for stability).
Beyond Feedforward Networks
Modern Deep RL uses many architectures: CNNs for images, RNNs/Transformers for sequential data, attention mechanisms for variable-length inputs. The principles from this chapter apply to all of them!
Least-Squares TD
Finding the optimal weights directly.
Skip the iteration, solve directly!
Standard TD updates gradually. But for linear methods, we can compute the TD fixed point directly by solving a system of equations. This is Least-Squares TD (LSTD). It's more sample-efficient but also more computationally expensive.
LSTD computes the weight vector directly:
Where:
Here xt = x(St) and εI is a small regularization term to ensure invertibility.
LSTD Tradeoffs
Advantages
- • Makes maximum use of data
- • No step-size parameter α
- • Directly computes fixed point
- • Better sample efficiency
Disadvantages
- • O(d²) computation per step
- • O(d²) memory for  matrix
- • O(d³) for matrix inversion
- • Only for linear approximation
For the Curious Mind
Incremental LSTD
We can maintain  and b̂ incrementally and use the Sherman-Morrison formula to update Â-1 directly, avoiding full matrix inversion. This gives O(d²) per step instead of O(d³), making LSTD practical for moderate-sized feature vectors.
Memory-based Function Approximation
Lazy learning: just remember and look up.
Why learn when you can remember?
Memory-based (or lazy) methods don't learn a global function. Instead, they store all training examples and, when asked about a new state, look up the most similar stored examples and combine their values.
Given a query state s, memory-based methods:
- 1.Find neighbors: Retrieve k stored examples closest to s
- 2.Combine values: Average (or weighted average) their stored values
Local vs. Global Learning
Global (Parametric)
Learn fixed parameters that work for all states. Every update affects the whole function.
Local (Memory-based)
Compute value on-the-fly from nearby examples. Each prediction is independent.
- ✓Naturally handles nonstationary problems (old data can be forgotten)
- ✓No training required — just store examples
- ✓Can represent complex, locally varying functions
- !Requires fast nearest-neighbor lookup for efficiency (kd-trees, etc.)
Kernel-based Function Approximation
Elegant mathematics for similarity-based learning.
Measuring similarity between states!
A kernel k(s, s') measures how similar two states are. Similar states should have similar values! Kernel methods generalize memory-based approaches with elegant mathematical foundations.
Given stored examples (si, vi), the value of a query state s is:
where k(s, si) is the kernel measuring similarity between s and si. Common choice: Gaussian kernel k(s, s') = exp(-||s-s'||²/2σ²).
The Kernel Trick
Kernels implicitly define a feature space — potentially infinite-dimensional! The kernel computes the inner product in this feature space without ever explicitly constructing the features. This gives the power of complex feature representations with the simplicity of linear methods.
Interest and Emphasis
Focusing learning where it matters.
Not all states are equally important!
In some problems, we care more about accuracy in certain states. InterestIt lets us specify how much we care about each state. EmphasisMt propagates this importance back through time, so we learn accurate values for states that lead to important states.
Interest It ∈ [0, 1]: How much do we care about accuracy at state St?
Emphasis Mt: Accumulates interest over time:
The update is then weighted by Mt:
When to Use Interest
Interest is useful when:
- •Some states are more decision-relevant than others
- •You want to focus learning on a subset of the state space
- •Different parts of the state space have different value to you
Quick Visual Summary: The Big Ideas
Function Approximation
Use a function with learnable weights instead of a huge table
Gradient Descent
Roll downhill to find weights that minimize error
Features Matter
Good features (tile coding, RBF) make learning easy
Generalization
Learning from one state helps predict values for similar states
The Core Update (What Happens Every Step)
wnew
New weights
wold
Current weights
α
Step size
error
How wrong?
∇v̂
Which way?
That's it! Every method in this chapter is just a variation of this simple idea.
Why This Actually Matters
This Is How Modern AI Works!
AlphaGo, ChatGPT,self-driving cars — all of them use function approximation. Neural networks (covered in 9.7) are the foundation of modern AI. When you hear "deep learning," it's just fancy function approximation with lots of layers! The gradient descent techniques in this chapter are literally how every major AI system is trained.
Real-World Applications
- →Video Games: DQN (2015) used neural nets to play Atari games from raw pixels — function approximation on 210×160×3 dimensional images!
- →Robotics: Robots learn smooth continuous movements using tile coding and neural nets — no lookup tables possible for infinite joint positions!
- →Recommendation Systems: Netflix, YouTube use learned value functions to predict which video you'll enjoy next.
Skills You'll Use
- ✓Gradient descent: The same technique trains ChatGPT. Understanding it here helps you understand all of deep learning!
- ✓Feature engineering: Choosing good representations is still crucial even with neural nets. This skill transfers to any ML job.
- ✓Convergence intuition: Understanding why some methods work and others diverge helps you debug real ML systems.
Bottom Line: Without function approximation, RL only works on tiny problems with a few hundred states. With function approximation, RL scales to solve problems with millions of states (like Go) or continuous states (like robot control). This chapter is the bridge from toy problems to real-world AI.
Summary
Key takeaways from Chapter 9.
Quick Reference Glossary
Function Approximation
Representing v̂(s,w) with fewer parameters than states
V̄E (Mean Squared Value Error)
The objective: weighted sum of squared errors
SGD (Stochastic Gradient Descent)
Update w in direction of negative gradient
Semi-gradient
Ignoring gradient through bootstrapped target
Linear Methods
v̂(s,w) = w⊤x(s) — simple, convergent
Tile Coding
Multiple overlapping grids for features
RBF (Radial Basis Functions)
Gaussian bumps as continuous features
LSTD (Least-Squares TD)
Directly compute TD fixed point: w = A⁻¹b
Formula Cheat Sheet
Mean Squared Value Error (Objective):
SGD Update (General):
Semi-gradient TD(0) Update:
Linear Value Function:
TD Fixed Point Error Bound:
LSTD Solution:
Common Mistakes (Avoid These!)
Wrong: "Semi-gradient methods converge to the global optimum"
Right: Semi-gradient TD converges to the TD fixed point, which is not the minimum of V̄E. It's within 1/(1-γ) of optimal for linear methods.
Wrong: "More features always means better approximation"
Right: More features increases representational power but can lead to overfitting and slower learning. Need balance between expressiveness and generalization.
Wrong: "Neural networks always work better than linear methods"
Right: Neural networks are powerful but have no convergence guarantees with bootstrapping. Linear methods with good features are often competitive and more stable!
What We Learned in This Chapter
To handle large or continuous state spaces, we use function approximationinstead of tables. This lets us generalize from limited experience!
- v̂(s,w) — Value is a function of parameters w
- ∇v̂ — Gradient tells us how to improve
- Linear methods — Simple, guaranteed convergence
- Tile coding/RBFs — Practical feature construction
- Neural networks — Learn features automatically (Deep RL!)
Function approximation enables RL to scale to large/continuous state spaces
The objective V̄E weights errors by how often states are visited (μ)
SGD updates weights in the direction that reduces error
Semi-gradient methods ignore the gradient through bootstrapped targets
Linear methods are simple, interpretable, and guaranteed to converge (on-policy)
The TD fixed point is within 1/(1-γ) of optimal for linear approximation
Tile coding is a practical and effective feature construction method
Neural networks can learn features automatically but lack convergence guarantees
LSTD directly computes the solution but has O(d²) cost
The Key Insight
Generalization is the power and the peril of approximation. By using fewer parameters than states, we force the function to generalize — learning about one state automatically affects similar states. This lets us handle huge state spaces, but also means we can never get the values exactly right everywhere. The art is in choosing representations that generalize in the right way!
Looking Ahead
This chapter covered on-policy prediction — estimating vπ while following π. Next, we'll extend these ideas to on-policy control (learning optimal policies with approximation), off-policy learning (the deadly triad!), and eligibility traces for more efficient credit assignment.