Psychology
RL and Animal Learning
Reinforcement learning has deep roots in psychology—indeed, the very term "reinforcement" comes from animal learning theory. This chapter explores the fascinating connections between RL algorithms and what psychologists have learned about how animals (including humans) learn through experience.
A Beautiful Convergence
Many core ideas in computational RL—temporal-difference learning, prediction from successive estimates, the distinction between model-based and model-free learning—were developed independently by psychologists studying animal behavior and engineers designing adaptive systems. The TD model of classical conditioning, for example, emerged from both Sutton's work on adaptive elements and psychologists' attempts to explain phenomena like blocking. This chapter shows how these parallel threads intertwine.
Prediction and Control
Psychologists have long distinguished between two fundamental types of learning: classical conditioning (learning to predict) and instrumental conditioning (learning to control). This maps remarkably well onto the RL distinction between prediction and control!
Classical (Pavlovian) Conditioning: Learning to predict important events. Pavlov's dogs learned that a bell predicts food, so they salivated in anticipation.
Instrumental (Operant) Conditioning: Learning which actions lead to rewards. A rat learns to press a lever to get food pellets.
The Correspondence
Classical Conditioning
Animal learns to predict rewards/punishments based on stimuli.
RL Analogue:
Prediction problem — learning value functions V(s) that predict future cumulative reward
Instrumental Conditioning
Animal learns which actions produce rewards.
RL Analogue:
Control problem — learning policies π(s) or action values Q(s,a) to maximize cumulative reward
In classical conditioning, the animal is passive—it doesn't control whether the bell rings or food arrives. It just learns the predictive relationship. In instrumental conditioning, the animal's actions influence what happens next.
While psychologists initially studied these as separate phenomena, RL reveals they're deeply connected. Control (instrumental) learning typically requires prediction (classical) learning as a component—you need to predict the consequences of your actions to choose wisely!
Terminology Translation
| Psychology Term | RL Term |
|---|---|
| Reinforcer / Primary reward | Reward signal |
| Conditioned stimulus (CS) | State features |
| Unconditioned stimulus (US) | Reward |
| Associative strength | Value / Weight |
| Secondary/Conditioned reinforcer | State with high value |
| Extinction | Value decay when reward removed |
Classical Conditioning
Classical conditioning is one of the most studied phenomena in psychology. Pavlov's original experiments in the 1900s showed that dogs learn to salivate when they hear a bell that has repeatedly preceded food. But the story gets much more interesting when we look at the details.
Pavlov's Classic Experiment
Before conditioning:
- Food (US) → Salivation (natural response)
- Bell → No particular response
During conditioning:
- Bell → Food → Salivation (repeated many times)
After conditioning:
- Bell alone → Salivation (conditioned response!)
Beyond Simple Association: The Blocking Effect
A crucial finding challenged simple associative theories. Discovered by Kamin (1969), the blocking effect shows that conditioning is not just about pairing stimuli—it's about information and surprise.
If an animal first learns that stimulus A predicts reward, and then experiences A+B together followed by the same reward, it learns almost nothing about B! The presence of A "blocks" learning about B because B provides no new predictive information—A already fully predicts the reward.
Blocking Experiment Design
Phase 1: Light A → Shock (repeated until A predicts shock)
Phase 2: Light A + Tone B → Shock (compound stimulus)
Test: Tone B alone → Little or no fear response!
The animal doesn't learn to fear B, even though B was paired with shock. Why? Because A already perfectly predicted the shock—B added no new information.
The Rescorla-Wagner Model
In 1972, Rescorla and Wagner proposed an elegant model that explained blocking and many other conditioning phenomena. The key insight: learning is driven by prediction error—the difference between what the animal expected and what actually happened.
where:
- = change in associative strength of stimulus s
- = learning rate for the CS (salience of stimulus)
- = learning rate for the US
- = reward (US) magnitude
- = total prediction from all stimuli present
In blocking: After Phase 1, V(A) ≈ R (A fully predicts the reward). In Phase 2, the prediction error is R - V(A) ≈ 0. With zero prediction error, there's no learning about B! This is exactly the same principle as TD learning: no update when predictions are accurate.
The TD Model of Classical Conditioning
The Rescorla-Wagner model has a limitation: it only considers the moment of reward delivery. But real conditioning involves time—the bell comes before the food. Sutton and Barto (1981, 1990) proposed a TD model that extends Rescorla-Wagner to handle temporal relationships.
This is essentially TD(0)! The prediction error now includes the next state's predicted value, allowing the model to explain:
Secondary Conditioning
If A predicts food, and then B predicts A, the animal learns to respond to B even though B never directly preceded food. Value "backs up" from A to B.
Timing Effects
Conditioning is strongest when CS precedes US by about 0.5 seconds. The TD model explains this through the temporal structure of prediction errors.
Neuroscience Validation: Dopamine and TD Errors
In a remarkable confirmation, neuroscientists discovered that dopamine neurons in the midbrain behave exactly like TD prediction errors! Schultz, Dayan, and Montague (1997) showed:
Dopamine Neuron Responses
Unexpected Reward
Dopamine neurons fire vigorously (positive TD error)
Fully Predicted Reward
No response at reward time—but firing shifts to the predictive cue!
Predicted Reward Omitted
Dopamine neurons pause/depress at expected reward time (negative TD error)
This correspondence between TD learning and dopamine is one of the most celebrated findings in computational neuroscience. It suggests that evolution discovered TD learning long before computer scientists—our brains literally implement something like TD(0)!
Instrumental Conditioning
While classical conditioning is about prediction, instrumental conditioning is about control—learning which actions to take to obtain rewards. This maps directly to the RL control problem.
Thorndike's Law of Effect
Edward Thorndike's experiments with cats in "puzzle boxes" (1898, 1911) established the foundational principle of instrumental learning:
The Law of Effect (Thorndike, 1911)
"Of several responses made to the same situation, those which are accompanied or closely followed by satisfaction to the animal will, other things being equal, be more firmly connected with the situation, so that, when it recurs, they will be more likely to recur."
In modern terms: actions that lead to reward become more probable in similar situations. This is the essence of RL policy improvement!
Thorndike's Puzzle Box Experiments
The Experiment
A hungry cat was placed in a box with a door that could be opened by manipulating a latch, lever, or loop. Food was visible outside. The cat would try various actions—scratching, biting, pushing—until accidentally operating the release mechanism.
Key observation: Over repeated trials, the time to escape decreased gradually (not suddenly as insight might suggest). The successful action became more probable through reinforcement.
RL Connection: Trial-and-Error Learning
Thorndike's work established trial-and-error learning as a fundamental mechanism. This is precisely what RL algorithms do:
Try Actions
Explore different behaviors in a situation (exploration)
Observe Outcomes
Note which actions lead to rewards (evaluation)
Adjust Probabilities
Make rewarded actions more likely (policy improvement)
Thorndike emphasized that learning was selectional not instructional—the environment selects from among the animal's varied behaviors, rather than instructing specific actions. This is a key distinction from supervised learning, where a "teacher" directly specifies correct outputs.
Skinner's Operant Conditioning
B.F. Skinner extended Thorndike's work with more controlled experiments and systematic analysis. His "Skinner box" allowed precise measurement of behavior:
- Shaping: Reinforcing successive approximations to desired behavior
- Schedules of reinforcement: Variable ratio, fixed interval, etc.
- Discrimination: Learning to respond differently to different stimuli
- Chaining: Building complex behaviors from simpler components
Each of these has RL analogues. Shaping corresponds to reward shaping; schedules relate to partial observability and credit assignment; discrimination is state discrimination; chaining relates to hierarchical RL.
Delayed Reinforcement
A central challenge in both animal learning and RL: how does an organism assign credit to earlier actions when rewards come later? This temporal credit assignment problemhas been studied extensively by psychologists.
The Problem
Delayed Reward Paradox
Imagine a rat running through a maze. It makes dozens of choices, but only gets food at the end. How does it learn that the third turn was the critical mistake, when the negative feedback comes many seconds and many actions later?
Psychological Solutions
1. Eligibility Traces
Psychologists proposed that recently active stimuli or responses leave a decaying "trace" that makes them eligible for association with later rewards. This is essentially the psychological version of eligibility traces!
The trace decays over time but spikes when a state is visited, making it "eligible" for credit.
2. Secondary (Conditioned) Reinforcement
Stimuli that predict primary rewards become reinforcing themselves. In the maze, the sight of the goal-adjacent corridor becomes a secondary reinforcer that bridges the temporal gap.
This is exactly value bootstrapping! States with high V(s) act as proxy rewards, propagating credit backward through the state space.
TD learning elegantly combines both mechanisms. The TD error uses bootstrapping (secondary reinforcement) at each step, and TD(λ) with eligibility traces extends credit backward to recently visited states. Evolution seems to have discovered these solutions independently!
Experimental Evidence
Studies with rats in mazes show that:
- Learning degrades rapidly with increasing delay between action and reward
- But secondary reinforcers (cues associated with reward) can bridge long delays
- Rats learn faster with intermediate markers/cues along the path
- The pattern of errors during learning suggests backward propagation of value
All of these observations are consistent with TD learning with eligibility traces.
Cognitive Maps
In the 1930s-40s, Edward Tolman challenged the dominant behaviorist view with evidence that animals form internal representations of their environment—cognitive maps—rather than just learning stimulus-response associations.
Tolman's Latent Learning Experiments
The Famous Experiment (1930)
Three groups of rats in a maze:
- Group 1: Always rewarded at goal — learned quickly
- Group 2: Never rewarded — wandered without improvement
- Group 3: No reward for 10 days, then reward introduced
Surprising result: Group 3 immediately performed as well as Group 1 once reward was introduced! They had learned the maze structure without explicit reinforcement—latent learning.
Learning that occurs without any obvious reinforcement and remains "latent" (hidden) until a situation arises where the knowledge becomes useful. This demonstrates that animals build internal models of their environment even without rewards.
RL Interpretation: Model-Based Learning
Tolman's findings map directly onto model-based RL:
Cognitive Map
Internal representation of environment structure
RL Analogue:
World model — learned transition dynamics p(s'|s,a)
Latent Learning
Learning environment structure without reward
RL Analogue:
Model learning — building transition model from exploration
When the reward was introduced to Group 3, they could immediately use their learned model to find the optimal path—like running value iteration on a previously learned MDP!
Shortcuts and Novel Routes
Tolman also showed that rats could take novel shortcuts they had never traversed, suggesting they had a map-like representation rather than just memorized action sequences:
Shortcut Experiment
Rats trained to run a circuitous route to food. When the original path was blocked and new paths opened, many rats immediately chose the path pointing most directly toward the goal location—a shortcut they had never taken!
This is only possible with an internal spatial model, not simple stimulus-response learning.
Model-based RL agents can similarly compute optimal behaviors for novel situations by planning with their learned model. This is the computational advantage of having a world model versus pure model-free learning.
Habitual and Goal-directed Behavior
Modern psychology distinguishes between two types of learned behavior, which map remarkably well onto the model-free vs model-based distinction in RL.
Habitual Behavior
- • Automatic, stimulus-triggered responses
- • Fast and effortless
- • Inflexible to changed outcomes
- • Acquired through extensive practice
- • Controlled by dorsolateral striatum
RL Analogue:
Model-free learning (cached action values)
Goal-directed Behavior
- • Deliberate, outcome-oriented actions
- • Slow and effortful
- • Flexible to changed outcomes
- • Used early in learning
- • Controlled by dorsomedial striatum & prefrontal cortex
RL Analogue:
Model-based learning (planning with world model)
The Outcome Devaluation Test
A clever experiment distinguishes these two types:
Experimental Design
Training: Rat learns to press lever for food pellets
Devaluation: Food pellets are paired with illness (making them aversive)
Test: Rat placed back with lever—does it still press?
Goal-directed (early training):
Stops pressing—knows pressing → pellets, pellets now bad
Habitual (extensive training):
Keeps pressing—automatic response disconnected from outcome value
Model-free methods cache action values without remembering how those values were computed. When the reward structure changes, they're slow to update. Model-based methods can immediately recompute values from their model—if you know pressing leads to pellets and pellets are now bad, you know pressing is bad without needing to experience it!
The Transition from Goal-directed to Habitual
With extensive practice, behavior shifts from goal-directed to habitual. This makes computational sense:
Early Learning: Model-based
Use model to plan because you don't have good value estimates yet. Flexible but computationally expensive.
After Practice: Model-free
Cached values are accurate from experience. Fast and automatic. Planning would just give the same answer more slowly.
The Dyna architecture (Chapter 8) combines both systems—using real experience for model-free updates while also using the model for planning. This may resemble how biological brains coordinate habitual and goal-directed systems.
Neural Substrates
Neuroscience has identified distinct brain regions for these systems:
Habitual System
- • Dorsolateral striatum
- • Sensorimotor cortex
- • Automatic, stimulus-response
Goal-directed System
- • Dorsomedial striatum
- • Prefrontal cortex
- • Deliberative, outcome-based
Damage to the dorsomedial striatum makes animals rely more on habits; damage to the dorsolateral striatum makes them more goal-directed. This double dissociation supports the two-system model that maps onto model-based and model-free RL.
Summary
This chapter has traced the deep connections between reinforcement learning and the psychology of animal learning. These connections are not coincidental—both fields are studying the same fundamental problem: how intelligent systems learn from interaction with their environment.
Key Correspondences
| Psychological Concept | RL Concept |
|---|---|
| Classical conditioning | Prediction (value function learning) |
| Instrumental conditioning | Control (policy learning) |
| Rescorla-Wagner model | TD(0) at reward time only |
| TD model of conditioning | TD learning |
| Blocking effect | No learning when TD error = 0 |
| Secondary reinforcement | Value bootstrapping |
| Eligibility traces (psychological) | Eligibility traces in TD(λ) |
| Cognitive maps | World models |
| Latent learning | Model learning without reward |
| Habitual behavior | Model-free learning |
| Goal-directed behavior | Model-based planning |
Major Insights
Prediction Error Drives Learning
The blocking effect showed that animals learn from surprise, not mere pairing. TD learning formalizes this: updates are proportional to prediction error. Dopamine neurons literally encode this signal in the brain.
Two Learning Systems
Both psychology and RL recognize the value of having both fast/inflexible (model-free) and slow/flexible (model-based) learning systems. The brain implements both with distinct neural substrates.
Evolution Discovered These Algorithms
The remarkable correspondence between RL algorithms developed by computer scientists and learning mechanisms discovered by psychologists suggests these are fundamental solutions to the learning problem—discovered independently by evolution and engineering.
The relationship between RL and psychology is bidirectional. Psychology inspired early RL (the term "reinforcement" itself comes from psychology). Now computational RL informs psychological theory—providing precise mathematical frameworks for understanding behavior and testable predictions about neural mechanisms.
Looking Forward
This chapter focused on connections to classical animal learning research. Modern work extends these ideas to:
- Understanding human decision-making and addiction
- Computational psychiatry (modeling mental disorders as learning dysfunctions)
- Designing brain-machine interfaces
- Understanding the neural basis of exploration vs exploitation
- Developing more human-like AI systems
The convergence of reinforcement learning and psychology/neuroscience represents one of the most exciting interdisciplinary frontiers in science—one that promises to illuminate both the nature of biological intelligence and the design of artificial systems.