Chapter 15

Neuroscience

RL and the Brain

Discover one of the most celebrated findings in computational neuroscience: dopamine neurons in the brain encode temporal-difference prediction errors. This chapter reveals how RL provides a computational framework for understanding the brain's reward system.

A Landmark Discovery

In 1997, Schultz, Dayan, and Montague made a remarkable discovery: the firing patterns of dopamine neurons in monkey brains closely match the temporal-difference (TD) error signal from RL theory. This finding established a direct link between computational learning algorithms and the biological mechanisms of reward learning in the brain.

Section 15.1

Neuroscience Basics

Before diving into the connections between RL and the brain, we need some basic neuroscience vocabulary. Don't worry—we'll keep it simple and focus only on what's needed to understand the reward system.

Neurons: The Brain's Computing Units

The brain contains roughly 86 billion neurons—specialized cells that process and transmit information through electrical and chemical signals.

Neuron Anatomy

Dendrites

Tree-like branches that receive signals from other neurons

Cell Body (Soma)

Integrates incoming signals; contains the nucleus

Axon

Long fiber that transmits signals to other neurons

Synapses

Junctions where neurons communicate via neurotransmitters

Action Potentials (Spikes)

Neurons communicate through brief electrical pulses called action potentials or "spikes." When enough input signals arrive, the neuron "fires"—sending a spike down its axon to communicate with other neurons. Information is encoded in the rate and timing of these spikes.

Synaptic Transmission

When a spike reaches the end of an axon, it triggers the release of chemical neurotransmitters into the synaptic cleft (gap between neurons). These chemicals bind to receptors on the receiving neuron, influencing whether it will fire.

Excitatory Synapses

Make the receiving neuron MORE likely to fire. Main neurotransmitter: Glutamate

Inhibitory Synapses

Make the receiving neuron LESS likely to fire. Main neurotransmitter: GABA

Neuromodulators: Broadcasting Signals

Beyond fast synaptic transmission, the brain uses neuromodulators—chemicals that diffuse more broadly to affect many neurons simultaneously. The most important for our story is dopamine.

Key Neuromodulators

Dopamine:Reward, motivation, learning (our main focus)

Serotonin:Mood, impulse control, possibly punishment signals

Norepinephrine:Arousal, attention, possibly uncertainty signals

Acetylcholine:Attention, memory formation

Why Neuromodulation Matters for RL

Neuromodulators can broadcast a single signal (like reward prediction error) to many brain regions simultaneously. This is exactly what's needed for a learning signal—it must reach all the synapses that need to be updated!

Section 15.2

Reward Signals, Reinforcement Signals, Values, and Prediction Errors

To understand how RL concepts map onto the brain, we must carefully distinguish between different types of signals. The terminology can be confusing because "reward" means different things in different contexts.

Types of Signals

Reward Signal

The immediate hedonic value or "goodness" of an outcome. In RL, this is $R_t$ .

Example: The pleasantness of eating food, the pain of a burn

Reinforcement Signal

The signal that drives synaptic plasticity (learning). NOT the same as reward!

Key insight: The reinforcement signal should be the prediction error, not raw reward

Value Signal

Prediction of future cumulative reward. In RL, this is $V(s)$ or $Q(s,a)$ .

Example: The expected long-term pleasure from visiting a favorite restaurant

Prediction Error Signal

The difference between what was expected and what actually happened.

\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)

This is the TD error—and dopamine neurons seem to encode exactly this!

A Crucial Distinction

Many early theories assumed reward itself was the learning signal. But the blocking effect (Chapter 14) showed this can't be right—animals don't learn when reward is fully predicted. The brain must compute something like a prediction error.

Why Prediction Error?

Think about what makes an effective learning signal:

Bad: Raw Reward

"I got food!" But if you expected food, there's nothing new to learn. Strengthening already-strong associations wastes resources.

Good: Prediction Error

"I got MORE food than expected!" This is informative—update your predictions. Zero error means predictions are accurate, no update needed.

This is why TD learning works so well, and why evolution seems to have implemented something similar in the brain's reward system.

Section 15.3

The Reward Prediction Error Hypothesis

The Reward Prediction Error (RPE) Hypothesis makes a bold claim: the phasic activity of dopamine neurons encodes a reward prediction error signal that serves as a teaching signal for reward-based learning.

The RPE Hypothesis

Dopamine neurons signal the difference between received and predicted reward:

\text{Dopamine Response} \approx R_{\text{actual}} - R_{\text{predicted}}

Positive error (unexpected reward) → Dopamine burst

Zero error (predicted reward) → Baseline firing

Negative error (omitted reward) → Dopamine pause

Historical Development

The hypothesis emerged from two converging lines of research:

Computational Theory (1980s-90s)

Sutton and Barto developed TD learning and showed prediction errors drive efficient learning. The TD model of classical conditioning successfully explained psychological phenomena like blocking.

Neurophysiology (1980s-90s)

Wolfram Schultz recorded from dopamine neurons in monkeys during conditioning tasks, discovering their responses had strange properties that didn't fit simple "reward neuron" interpretations.

The Convergence

In 1997, Montague, Dayan, and Sejnowski (and independently Schultz, Dayan, and Montague) showed that Schultz's dopamine data matched TD error predictions remarkably well. This paper launched computational neuroscience into the mainstream!

Why It Matters

The RPE hypothesis is significant because:

It provides a computational interpretation of dopamine function
It explains many otherwise puzzling features of dopamine neuron activity
It connects RL algorithms to biological learning mechanisms
It makes testable predictions about neural activity
It suggests evolution discovered TD learning

Section 15.4

Dopamine

Dopamine is a neuromodulator produced by neurons in two small midbrain structures: the substantia nigra pars compacta (SNc) and the ventral tegmental area (VTA). Despite these regions being tiny, their axons project throughout the brain.

Dopamine Pathways

Mesolimbic Pathway (VTA → Nucleus Accumbens)

Associated with reward, motivation, and reinforcement learning

Nigrostriatal Pathway (SNc → Dorsal Striatum)

Associated with motor control and habit formation

Mesocortical Pathway (VTA → Prefrontal Cortex)

Associated with cognition, working memory, and goal-directed behavior

Two Modes of Dopamine Activity

Tonic Activity

Steady, low-frequency background firing (~3-5 Hz). Maintains baseline dopamine levels.

May relate to average reward rate, motivation, or general responsiveness.

Phasic Activity

Brief bursts or pauses in response to events. This is what encodes prediction errors!

Bursts: 15-30 Hz for ~200ms. Pauses: Near-complete suppression for ~200ms.

Phasic Dopamine Response

A transient change in dopamine neuron firing that lasts about 100-500 milliseconds. Bursts (increased firing) occur to better-than-expected outcomes; pauses (decreased firing) occur to worse-than-expected outcomes. This phasic activity is what corresponds to TD error.

Dopamine and Synaptic Plasticity

How does dopamine actually cause learning? Through effects on synaptic plasticity:

Dopamine modulates Long-Term Potentiation (LTP)—strengthening synapses
Dopamine modulates Long-Term Depression (LTD)—weakening synapses
The direction depends on timing and dopamine receptor types
D1 receptors generally promote LTP; D2 receptors may promote LTD

This means dopamine bursts can strengthen synapses that were recently active (reinforcing the actions/associations that led to reward), while pauses can weaken them.

Section 15.5

Experimental Support for the RPE Hypothesis

The RPE hypothesis has been tested extensively in both monkeys and rodents. The evidence is remarkably consistent with TD-like prediction errors.

Schultz's Classic Experiments

Wolfram Schultz trained monkeys on classical conditioning tasks while recording from dopamine neurons. Here's what he found:

The Three Key Findings

Unexpected Reward

When reward comes unexpectedly, dopamine neurons fire a burst of spikes.

δ = R - 0 = positive → burst (matches TD prediction!)

Fully Predicted Reward

After learning, dopamine neurons NO LONGER respond to reward delivery! Instead, they respond to the predictive cue (CS).

At reward: δ = R - R = 0 → no response. At CS: δ = V(CS) - 0 = positive → burst

Omitted Reward

When expected reward is omitted, dopamine neurons pause (decrease firing) at the time reward was expected.

δ = 0 - R = negative → pause (matches negative TD error!)

Response Shift Over Learning

The most striking finding is how dopamine responses shift during learning. This is exactly what TD learning predicts:

Timeline of Learning

Early

Strong response to reward, no response to cue

Middle

Response emerges to cue, reward response diminishes

Late

Strong response to cue only, no response to predicted reward

Why the Response Shifts

In TD learning, the prediction error occurs at the point of "surprise"—the earliest moment when new information arrives. Initially, reward is surprising. After learning, the cue predicts everything, so the cue timing is when the surprise happens. Value literally "backs up" from reward to cue!

Section 15.6

TD Error/Dopamine Correspondence

The correspondence between TD error and dopamine is not just qualitative—quantitative analyses show remarkably precise alignment.

Quantitative Matches

Key Correspondences

Reward magnitude scales response

Larger rewards → larger dopamine bursts, matching δ = R - V

Probability affects response

Less probable rewards → larger responses (higher surprise)

Timing is precise

Pauses occur at exact expected reward time when omitted

Secondary conditioning

Response transfers to earlier predictive cues

The TD(0) Formula in the Brain

The dopamine signal appears to encode something very close to:

\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)

Let's see how this maps to neural activity:

TD Component	Neural Correlate
$R_{t+1}$	Sensory reward signals (taste, etc.)
$V(S_{t+1})$	Cortical/striatal value representations
$V(S_t)$	Previous value prediction (baseline expectation)
$\delta_t$	Phasic dopamine response!

Not Just Metaphor

This isn't just a loose analogy. Formal model fitting shows that TD models with appropriate parameters can predict dopamine responses trial-by-trial, often capturing ~60-80% of the variance in firing rates. The brain really seems to compute something like TD error.

Caveats and Extensions

The basic correspondence is solid, but reality is richer:

Dopamine neurons show some heterogeneity—not all encode pure RPE
Some neurons respond to novelty, salience, or aversive stimuli
The discount factor γ may vary across brain regions
Tonic dopamine may carry additional information (average reward?)

These complexities are active areas of research, but don't diminish the core finding—dopamine as RPE is one of neuroscience's most successful computational theories.

Section 15.7

Neural Actor-Critic

The actor-critic architecture (Chapter 13) maps beautifully onto brain anatomy. The brain appears to implement separate systems for the "critic" (value estimation) and "actor" (action selection).

Actor-Critic in the Brain

The Critic

Evaluates states/actions and computes TD error

Neural substrates: Ventral striatum, orbitofrontal cortex, with dopamine as the error signal

The Actor

Selects actions based on learned policy

Neural substrates: Dorsal striatum, motor cortex, updated by dopamine signal

The Basal Ganglia Circuit

The basal ganglia—a group of subcortical nuclei—appears to implement the actor-critic architecture:

Key Structures

Ventral Striatum (including Nucleus Accumbens)

The "Critic"—represents values, receives reward information

Dorsal Striatum (Caudate, Putamen)

The "Actor"—represents action values, habit formation

SNc/VTA Dopamine Neurons

The "Error Signal"—broadcast TD error to both actor and critic

Direct and Indirect Pathways

The striatum has two main output pathways that may implement different aspects of learning:

Direct Pathway (D1)

"Go" pathway—promotes actions

Dopamine (via D1 receptors) strengthens this pathway, reinforcing good actions

Indirect Pathway (D2)

"No-Go" pathway—suppresses actions

Low dopamine (via D2 receptors) strengthens this pathway, avoiding bad actions

Opponent Learning

This dual-pathway structure may implement something like "opponent" learning—positive prediction errors strengthen "Go" for chosen actions, while negative errors strengthen "No-Go." This could enable learning from both rewards and punishments!

Section 15.8

Actor and Critic Learning Rules

How do the actor and critic actually learn? The dopamine TD error signal must be combined with local information about what was active during learning.

The Critic's Learning Rule

The critic learns value predictions using something like TD(0):

V(s) \leftarrow V(s) + \alpha \cdot \delta \cdot e(s)

where $e(s)$ is an eligibility trace marking recently visited states. In neural terms:

$\delta$ = dopamine signal (TD error)
$e(s)$ = recent synaptic activity (eligibility trace)
Synapses that were recently active get strengthened/weakened by dopamine

The Actor's Learning Rule

The actor learns action preferences using something like policy gradient:

\theta \leftarrow \theta + \alpha \cdot \delta \cdot \nabla_\theta \log \pi(a|s)

In neural terms, this means:

Positive δ (reward better than expected): Strengthen synapses for the action that was taken → more likely to repeat

Negative δ (reward worse than expected): Weaken synapses for the action that was taken → less likely to repeat

Three-Factor Learning Rule

Both actor and critic learning can be seen as three-factor rules: learning requires (1) presynaptic activity, (2) postsynaptic activity, and (3) a modulatory signal (dopamine). This is more selective than simple Hebbian learning—it ensures only reward-relevant associations are strengthened.

Neural Eligibility Traces

A key problem: dopamine arrives later than the synaptic activity it should modify. The brain appears to solve this with biological eligibility traces:

Synaptic activity leaves a molecular "tag" that persists for seconds
This tag makes the synapse eligible for modification
When dopamine arrives, only tagged synapses are changed
Calcium dynamics and protein kinase cascades may implement this

This elegantly solves the temporal credit assignment problem—only recently active synapses get credit for rewards received later.

Section 15.9

Hedonistic Neurons

A fascinating question: could individual neurons implement reinforcement learning? The hedonistic neuron hypothesis proposes that neurons might try to maximize their own "reward" (certain patterns of input).

The Hedonistic Neuron Hypothesis

Individual neurons adjust their synaptic weights to maximize the neuromodulatory signals they receive. If dopamine is "rewarding" to neurons, they will learn to make dopamine more likely—aligning their learning with behavioral goals!

How It Could Work

1. Dopamine as Intrinsic Reward

Dopamine might directly modulate synaptic plasticity in ways that make the neuron more likely to produce outputs that led to dopamine release.

2. Local Optimization

Each neuron tries to maximize its own dopamine input—but because dopamine signals behavioral success, this aligns individual neurons with organism-level goals.

3. Emergent Cooperation

Networks of hedonistic neurons might collectively learn complex behaviors—each pursuing local reward but contributing to global success.

Connection to Policy Gradient

Interestingly, if neurons learn to maximize dopamine using something like a stochastic gradient ascent on their outputs, the collective network learning might approximate policy gradient methods! The global RPE signal enables local learning that serves global objectives.

Section 15.10

Collective Reinforcement Learning

The brain faces a massive credit assignment problem: billions of synapses must be coordinated to produce adaptive behavior, but only a single global reward signal is available. How can this work?

The Structural Credit Assignment Problem

Even if we solve temporal credit assignment (knowing when to update), we still need structural credit assignment: which synapses should change, and by how much?

The Problem

A single dopamine signal is broadcast to billions of synapses. If they all just strengthen when dopamine is high, wouldn't irrelevant synapses also strengthen? How does the brain avoid reinforcing noise?

Solutions

Eligibility Traces Filter Updates

Only synapses that were recently active have eligibility traces, so only they can be modified. This filters updates to relevant synapses.

Variance Reduction Through Baselines

Baseline firing rates and average value estimates help separate signal from noise, similar to using baselines in REINFORCE.

Stochastic Sampling

Neural variability means only some synapses contribute to each action. Over many trials, relevant synapses get reinforced more consistently than irrelevant ones.

Why Noise Helps

Counterintuitively, neural noise may help! If synaptic strengths randomly fluctuate, the effect on behavior varies trial-by-trial. Synapses whose fluctuations correlate with reward will get consistently reinforced; others average out. This is essentially a biological implementation of evolutionary strategies or policy gradient!

Section 15.11

Model-based Methods in the Brain

We've focused on model-free RL, but the brain also implements model-based planning. Chapter 14 discussed the psychological evidence; here we consider neural substrates.

Neural Substrates of Model-based Learning

Prefrontal Cortex

• Planning and decision-making
• Working memory for goals
• Model-based value computation
• Outcome simulation

Hippocampus

• Cognitive maps (spatial and otherwise)
• Episodic memory
• Sequence replay for planning
• Model learning

Replay: Offline Planning

A remarkable discovery: during rest and sleep, the hippocampus "replays" sequences of neural activity corresponding to past experiences—often in reverse or accelerated.

Types of Replay

Forward Replay

Simulating future trajectories—planning ahead

Reverse Replay

Playing experiences backward—may help propagate values via TD

Prioritized Replay

High-reward or surprising experiences replay more—like prioritized sweeping!

Dyna in the Brain

Replay looks remarkably like Dyna (Chapter 8)! The brain interleaves real experience with simulated experience, using its model (in hippocampus) to generate training data for the value system (in striatum). Even the prioritization of surprising events matches prioritized sweeping algorithms.

Prospective Decision-making

fMRI studies show that when humans make complex decisions:

Hippocampus activates when imagining future scenarios
Prefrontal cortex evaluates imagined outcomes
Different options show sequential activation (search)
Activity patterns correlate with model-based choice

This suggests tree-search-like planning in the brain—possibly something like MCTS for important decisions!

Section 15.12

Addiction

Understanding dopamine as a prediction error signal sheds new light on addiction. Addictive drugs hijack the brain's learning system in specific, predictable ways.

How Drugs Hijack Dopamine

Most addictive drugs—cocaine, amphetamines, opioids, nicotine, alcohol—directly or indirectly increase dopamine signaling. This creates artificial "prediction errors" that tell the brain: "This was much better than expected—learn this!"

The RL Perspective on Addiction

Pharmacological Prediction Error

Drugs create dopamine signals that don't reflect actual value—they're "counterfeit" prediction errors. The brain can't distinguish drug-induced dopamine from natural reward signals.

Aberrant Learning

These false signals drive powerful learning—strengthening drug-seeking actions, associations with drug cues, and motivation toward drugs. The brain literally learns that drugs are the most valuable thing.

Tolerance and Sensitization

With repeated use, the baseline shifts. The brain adapts to expect drug-induced dopamine—so normal rewards feel worse (tolerance), while drug cues become increasingly powerful (sensitization).

Why Addiction is So Persistent

The RL framework helps explain why addiction is so hard to overcome:

Overlearned habits: Drug-seeking becomes deeply ingrained (model-free)
Cue-triggered craving: Drug cues have acquired enormous value
Impaired prefrontal control: Model-based system weakened by drug effects
Distorted value estimates: Non-drug rewards seem worthless by comparison

Computational Psychiatry

This RL perspective on addiction is part of computational psychiatry—using computational models to understand mental disorders. Similar approaches model depression (low average reward estimates), anxiety (overestimation of negative outcomes), and other conditions as disruptions to the brain's RL machinery.

Section 15.13

Summary

This chapter has explored one of the most successful connections between AI and neuroscience: the correspondence between TD learning and dopamine. This isn't just an analogy—it's a deep computational theory that generates testable predictions about neural activity.

Key Findings

Dopamine encodes TD error

Phasic dopamine activity matches the mathematical form of TD prediction error

Actor-critic in basal ganglia

Ventral striatum (critic) and dorsal striatum (actor) implement dual-system learning

Model-based planning via replay

Hippocampal replay implements Dyna-like model-based learning

Addiction as aberrant learning

Drugs hijack the dopamine system, creating false prediction errors

Major Insights

Evolution Discovered TD Learning

The striking similarity between engineered RL algorithms and biological reward systems suggests these are fundamental solutions to the learning problem—discovered independently by evolution and computer science.

Bidirectional Inspiration

RL theory helps interpret neural data; neuroscience findings inspire new algorithms. This cross-fertilization has been immensely productive for both fields.

Credit Assignment in Neural Networks

The brain's solution to credit assignment—eligibility traces, modulatory signals, and stochastic sampling—may inspire more biologically-plausible deep learning algorithms.

Open Questions

Despite the success, many questions remain: How does the brain represent states and values? What's the exact discount factor? How do model-based and model-free systems interact? How is exploration controlled? These are active research areas at the intersection of RL, neuroscience, and AI.

Looking Forward

The connection between RL and neuroscience continues to deepen:

Better neural recordings test finer predictions of RL models
Optogenetics allows causal manipulation of dopamine signals
Computational psychiatry applies RL to mental disorders
Brain-machine interfaces leverage RL for control
Neuroscience insights improve AI algorithms

The dopamine-as-TD-error hypothesis stands as one of the great success stories of computational neuroscience—a beautiful example of theory and experiment coming together to reveal how the brain learns from rewards.

Chapter 14: Psychology

Back to

All Chapters