Monte Carlo Methods
Learn from complete episodes without knowing the environment model. MC methods average sample returns to estimate value functions.
Welcome! Here's What This Chapter is About
In Chapter 4, we learned Dynamic Programming (DP) — powerful methods that require a complete model of the environment (all transition probabilities and rewards). But here's the problem: in the real world, we rarely have such a perfect model!
Monte Carlo methods solve this problem by learning directly from experience — actual sequences of states, actions, and rewards. No model needed!
Simple analogy: Imagine learning to navigate a new city.
- • DP approach: Study a perfect map with traffic patterns, road conditions, and timing for every route
- • MC approach: Just start driving! Take different routes, track how long each trip takes, and average your experiences
Prerequisites (Quick Refresher)
Model-Free
No dynamics needed
Episodic
Complete episodes
Sample Averages
Unbiased estimates
The Big Picture: Learning from Episodes
What Exactly is an "Episode"?
An episode is one complete experience from start to finish. Think of it like one game of chess, one maze run, or one blackjack hand — it has a clear beginning and end.
Example episode (robot navigating a room):
Total return: -1 + 0 + 10 = +9
🎲 Why "Monte Carlo"?
The name comes from the famous casino in Monaco! Just like gambling involves randomness and averaging outcomes, MC methods use random sampling to estimate values. The key insight: if you flip a coin enough times, you'll discover it's ~50% heads.
🔑 The Key Insight
We don't need to know why certain states lead to good outcomes. We just need to observe what happens and average the results. With enough samples, our averages converge to the true expected values!
🎯 The MC Recipe (3 Simple Steps)
Play through a complete episode
Follow your policy from start to terminal state
Calculate the return for each visited state
Sum up all rewards from that state until episode end
Update each state's value = average of observed returns
More samples → better estimate!
Monte Carlo Prediction
Estimating state values by averaging sample returns.
What You'll Learn in This Section
- • How to estimate V(s) without knowing the environment's dynamics
- • The difference between First-Visit and Every-Visit MC
- • Why averaging samples gives us the true value (Law of Large Numbers)
Given a policy π, estimate vπ(s) by averaging returns observed after visiting state s. No model of the environment is needed—just sample episodes.
Deep Dive Analogy: Rating Coffee Shops
Imagine you're trying to figure out the "value" of different neighborhoods for getting coffee. You have no idea about shop ratings, prices, or quality beforehand — you just start visiting.
Your "Episode" = One Coffee Trip
You leave home, walk through neighborhood A, stop at a coffee shop, rate your experience (the "return"), and go home. That entire journey from leaving to returning home is one episode.
Trip 1: Neighborhood A
Great latte! Rating: +8
Trip 2: Neighborhood A
Long wait, ok coffee. Rating: +5
Trip 3: Neighborhood A
Amazing pastry! Rating: +9
Trip 4: Neighborhood A
Crowded, decent. Rating: +6
Your estimate of V(Neighborhood A):
(8 + 5 + 9 + 6) / 4 = 7.0
As you make more trips, this average gets closer to the "true" value of neighborhood A for coffee!
First-Visit vs. Every-Visit: What's the Difference?
Sometimes you visit the same state multiple times within a single episode. For example, you might walk through neighborhood A twice during one coffee trip (going and returning). Should you count both visits or just the first one?
First-Visit MC
Only use the return from the first time you visit state s in each episode. Ignore subsequent visits.
Example episode: A → B → A → C → Goal
For state A: Only count the return from the first A, ignore the second A
Every-Visit MC
Use the return from every visit to state s, even if you visit multiple times per episode.
Example episode: A → B → A → C → Goal
For state A: Count returns from both visits to A
Bottom line: Both methods converge to the true value V(s). First-Visit is more common in theory; Every-Visit is sometimes easier to implement. For most practical purposes, the difference is minor.
Interactive First-Visit MC Demo
State Values (First-Visit MC)
First-Visit MC Update
Each state's value is the average of all returns observed after first visiting that state.
Recent Episodes
No episodes yet
Watch how values converge as more episodes are sampled. States visited more often have more accurate estimates.
MC vs DP: A Detailed Comparison
| Aspect | Dynamic Programming | Monte Carlo |
|---|---|---|
| Model Required? | Yes - need p(s',r|s,a) | No - learn from experience |
| Bootstrapping? | Yes - uses V(s') estimates | No - uses actual returns |
| Updates | All states every sweep | Only visited states |
| When to Update? | Any time (synchronous) | After episode ends |
| Analogy | Planning with a perfect map | Learning by exploring |
Why Does Averaging Work? (The Math Behind MC)
The magic of MC comes from a fundamental principle in probability: the Law of Large Numbers.
The Law of Large Numbers (Simplified)
As you collect more and more samples, their average converges to the true expected value.
Average of n returns → True value (as n gets large)
n=10
Rough estimate
n=100
Better estimate
n=1000
Very accurate!
For the Curious Mind
Monte Carlo Methods Predate Computers!
The Monte Carlo method was invented during the Manhattan Project in the 1940s. Scientists like Stanislaw Ulam and John von Neumann used random sampling to simulate neutron diffusion — problems too complex for analytical solutions. They ran simulations by hand and with early mechanical calculators!
Why Not Just Use the Latest Return?
You might wonder: why average many returns instead of just using the most recent one? The answer is variance. A single return can be unusually high or low due to randomness. Averaging smooths out this noise. It's like asking many people for directions vs. asking just one person.
MC Powers Modern Game AI!
Monte Carlo Tree Search (MCTS) powered AlphaGo's victory over the world champion. The idea: "From this position, simulate many random games. Pick the move that wins most often." Combined with neural networks, this became unbeatable at Go!
Monte Carlo Estimation of Action Values
When we don't have a model, we need action values for control.
What You'll Learn in This Section
- • Why state values V(s) aren't enough for model-free control
- • What action values Q(s,a) are and why we need them
- • The exploration problem and why it matters
The Big Question: Why Do We Need Action Values?
In Chapter 4 (Dynamic Programming), knowing V(s) was enough. Why? Because we had a model! We could "look ahead" and calculate: "If I take action A in state S, what state S' will I land in, and what's V(S')?"
The Problem Without a Model
Without knowing transition probabilities, we can't compute which action leads to the best next state! Knowing V(s) tells us "how good is this state?" but NOT "which action should I take?"
The Solution: Action Values Q(s,a)
Q(s,a) directly tells us: "How good is it to take action A in state S?"
To choose the best action, we simply pick: argmaxa Q(s,a)
No model needed — just compare Q values!
Without a model, state values alone are not enough for control. We need q(s,a) to know which action to take without computing expected values over unknown transitions.
The Exploration Problem (Very Important!)
Here's a critical challenge: if your policy always picks the same action, you'll never learn about alternatives!
🍕 The Pizza Analogy
Imagine you always order pepperoni pizza. You've rated pepperoni as 8/10. But you've never tried the margherita, so Q(restaurant, margherita) = unknown!
What if margherita is actually 10/10? You'll never know unless you explore!
For MC to work, we need to ensure every state-action pair is visited. Two main approaches:
Exploring Starts
Every episode starts from a randomly chosen (state, action) pair. This guarantees every pair gets tried eventually.
Soft Policies
The policy gives nonzero probability to ALL actions: π(a|s) > 0 for all s, a. So every action can be chosen!
Blackjack Example
Blackjack is a perfect MC example! We don't need to know card probabilities — we just play many hands and average results. The state is (your sum, dealer's showing card, usable ace?). Actions are "hit" or "stick."
Blackjack
Click to start a new game
Results
State Space
- • Player sum: 12-21 (10 values)
- • Dealer showing: 1-10 (10 values)
- • Usable ace: yes/no (2 values)
- • Total: 200 states
MC methods learn the value of each state by averaging returns from many episodes. The state is (player sum, dealer showing, usable ace).
Monte Carlo Control
Using MC for both prediction and improvement.
What You'll Learn in This Section
- • How to use MC for control (finding optimal policies), not just prediction
- • The MC version of Generalized Policy Iteration (GPI)
- • Why "Exploring Starts" guarantees convergence
From Prediction to Control: The Same Recipe, Different Ingredients
Remember Generalized Policy Iteration (GPI) from Chapter 4? The beautiful insight: we alternate between evaluating a policy and improving it. MC control uses the exact same pattern!
DP Version (Chapter 4)
Evaluate: Compute V(s) using Bellman equations
Improve: π(s) = argmax over model-computed expectations
MC Version (This Chapter)
Evaluate: Estimate Q(s,a) by averaging returns
Improve: π(s) = argmaxa Q(s,a)
MC Control with Exploring Starts (Step by Step)
Generate Episode with Exploring Starts
Pick a random state S0 and random action A0 to start. Then follow the current policy π until the episode ends.
Update Q Values (Evaluation)
For each (state, action) pair visited in the episode, calculate the return G (sum of rewards from that point onward). Update Q(s,a) to be the average of all such returns.
Improve Policy (Greedy)
For each state visited, update the policy to pick the action with the highest Q value.
Repeat!
Go back to step 1 and generate another episode. Keep iterating — Q values get better, policy gets better, which generates better episodes, which gives better Q values...
MC control follows the same GPI pattern as DP: evaluate the policy, then improve it greedily. The difference is we use sampled returns instead of computed expectations. Both converge to the optimal policy!
The MC Control Loop (Visual)
Generate
Play episode
Evaluate
Update Q
Improve
Greedy π
Repeat many times → Converges to optimal!
The Catch: Exploring Starts is Often Unrealistic
MC control with exploring starts works great in theory. But in practice:
- ✗You can't "teleport" a self-driving car to random positions
- ✗A robot can't start in arbitrary physical configurations
- ✗In games, you usually start from the initial position
Solution: Next section shows how to explore without needing exploring starts!
Monte Carlo Control without Exploring Starts
Using soft policies to ensure exploration.
What You'll Learn in This Section
- • ε-greedy policies: balancing exploration and exploitation
- • Why "soft" policies ensure exploration without special starts
- • The trade-off: finding "best soft policy" vs true optimal
The Elegant Solution: Sometimes Act Randomly!
Since we can't always start from random states, we need another way to explore all actions. The brilliant idea: what if we occasionally take random actions during normal execution?
ε-greedy Policy (The Most Common Solution)
Most of the time (1-ε): Take the best action according to current Q values
Sometimes (ε): Take a completely random action
Example: With ε=0.1, you'll explore 10% of the time and exploit 90% of the time
A policy is ε-soft if π(a|s) ≥ ε/|A| for all states and actions. This guarantees all actions are tried with at least probability ε/|A|. The most common ε-soft policy is ε-greedy.
The ε-Greedy Policy (Detailed Breakdown)
📖 Breaking Down Each Part:
Probability of taking action a in state s
This is what we're calculating — how likely is each action?
Exploration rate (a number between 0 and 1)
Common values: 0.1 (10% exploration) or 0.05 (5% exploration)
Number of possible actions
E.g., if actions are {up, down, left, right}, then |A| = 4
The action with the highest Q value (the "greedy" action)
This is the action we'd always take if we weren't exploring
🔢 Worked Example (ε = 0.1, 4 actions):
Suppose Q values for state S are: Q(S, up)=5, Q(S, down)=3, Q(S, left)=2, Q(S, right)=4
Best action: "up" (highest Q = 5)
Probabilities:
Sanity check: 0.925 + 0.025 + 0.025 + 0.025 = 1.0 ✓
Exploration (ε portion)
Random actions ensure we keep discovering new information. Without this, we might get stuck with a suboptimal policy forever!
Like occasionally trying a new restaurant instead of your usual
Exploitation (1-ε portion)
Taking the best-known action maximizes immediate reward. This is how we benefit from what we've already learned!
Like going to your favorite restaurant that you know is good
On-policy ε-greedy MC converges to the best ε-soft policy, NOT the truly optimal policy. Why? Because we're always exploring with probability ε, so we can't be fully greedy.
To find the true optimal policy, we need off-policy methods (next sections).
For the Curious Mind
The Explore-Exploit Dilemma is Everywhere!
This isn't just an RL problem — it's a fundamental life dilemma! Should you go to your favorite restaurant or try somewhere new? Should a company invest in existing products or R&D? Should you stick with your job or interview elsewhere?
Smarter Exploration Exists!
ε-greedy is simple but "blind" — it explores randomly even for well-understood actions. Advanced methods like UCB and Thompson Samplingexplore intelligently: more for uncertain actions, less for actions we know well.
Off-policy Prediction via Importance Sampling
Learning about one policy while following another.
What You'll Learn in This Section
- • On-policy vs off-policy learning: what's the difference?
- • Why off-policy is powerful: separate exploration from learning
- • Behavior policy vs target policy concepts
The Breakthrough Idea: Separate Learning from Exploring!
With ε-greedy, we were limited — we could only learn the best ε-soft policy, not the true optimal. Why? Because the same policy had to both explore AND be optimal.
The Off-Policy Solution
What if we use two different policies?
- • Behavior policy (b): What we actually do — exploratory, tries everything
- • Target policy (π): What we want to learn about — can be greedy/optimal!
We follow b (to explore), but we learn about π (which can be optimal). Best of both worlds!
On-Policy
Learn from own actions
Same policy is used for both:
- • Generating experience (behavior)
- • Being evaluated/improved (target)
Example: SARSA, ε-greedy MC Control
Off-Policy
Learn from others' actions
Different policies:
- • Behavior (b): generates data
- • Target (π): being evaluated
Example: Q-learning, Off-policy MC Control
Coverage Requirement
For off-policy learning to work, the behavior policy must have coverage: it must take all actions that the target policy might take. Otherwise, we can't learn about those actions.
On-Policy vs Off-Policy (Key Distinction)
On-Policy
Learn about the same policy you're using to collect data.
✓ Simpler to implement
✗ Can't find truly optimal policy while exploring
Off-Policy
Learn about a different policy than the one collecting data.
✓ Can learn optimal policy while exploring
✗ More complex (needs importance sampling)
Real-World Analogy: Learning from a Mentor
Imagine you want to learn the optimal strategy for day trading, but you're risk-averse.
Behavior Policy (Your Mentor)
Your mentor makes risky trades, explores different strategies, sometimes loses money. You watch and record everything they do.
Target Policy (What You Want to Learn)
Using your mentor's experiences, you figure out what the optimal strategy would be — without ever taking those risks yourself!
Key insight: You learn from their experiences, but reweight them based on how likely YOU would have made those same choices. That's importance sampling!
Importance Sampling
Correcting for the difference between behavior and target policies.
What You'll Learn in This Section
- • The importance sampling ratio and what it means intuitively
- • Ordinary vs weighted importance sampling
- • The bias-variance trade-off between the two methods
The Core Problem Importance Sampling Solves
We have data from policy b, but we want to estimate values for policy π. The problem: the same trajectory has different probabilities under different policies!
🎯 Example
Suppose under behavior policy b, action sequence [left, left, right] happens 40% of the time. But under target policy π, the same sequence would only happen 5% of the time.
If we just average returns from b, we'd be overweighting trajectories that π rarely takes! We need to correct for this.
The Solution: Reweight by the Ratio
Multiply each return by: (probability under π) / (probability under b)
This "corrects" the sample to reflect what π would have experienced!
The Importance Sampling Ratio (Step by Step)
📖 Understanding Each Symbol:
The importance sampling ratio
A single number that corrects the entire trajectory
Multiply all terms together
Like Σ (sum) but multiplying instead of adding
Probability target policy π would take action Ak
What we WANT to learn about
Probability behavior policy b took action Ak
What actually happened during data collection
🔢 Worked Example (3-step trajectory):
Trajectory: S₀ → A₀ → S₁ → A₁ → S₂ → A₂ → Terminal
| Step | π(A|S) | b(A|S) | Ratio |
|---|---|---|---|
| k=0 | 0.8 | 0.4 | 2.0 |
| k=1 | 0.6 | 0.3 | 2.0 |
| k=2 | 0.1 | 0.5 | 0.2 |
Total ρ = 2.0 × 2.0 × 0.2 = 0.8
Interpretation: This trajectory is 80% as likely under π as under b. So we slightly downweight its return when estimating Vπ.
Interactive Importance Sampling Demo
Policy we use to collect data
Policy we want to evaluate
Importance Sampling Ratio
The ratio corrects for the difference between the two policies. When π(a) > b(a), we upweight that sample. When π(a) < b(a), we downweight it.
Sample Data
Ordinary IS
Simple average of weighted returns. Unbiased but high variance.
Weighted IS
Normalize by sum of ratios. Biased but lower variance.
Try setting the policies very different (e.g., b=0.2, π=0.8) to see high variance in ordinary IS. Weighted IS is more stable but slightly biased.
Ordinary Importance Sampling
Plain English: Sum of (ratio × return), divided by number of samples
✓ Unbiased (correct on average)
✗ High variance (can swing wildly)
Weighted Importance Sampling
Plain English: Sum of (ratio × return), divided by sum of ratios
✗ Biased (systematically off)
✓ Lower variance (more stable)
The Bias-Variance Trade-off (Important Concept!)
This is a fundamental trade-off in machine learning that you'll see again and again:
Bias
How far off is our estimate on average? If we repeated the experiment many times, would we get the right answer on average?
Weighted IS: Has bias (systematically off)
Variance
How much do our estimates fluctuate? If we repeated the experiment, how much would results vary?
Weighted IS: Low variance (more stable)
Bottom line: Weighted IS is usually preferred in practice because its bias disappears with more samples, while high variance can make learning unstable.
Off-policy Monte Carlo Control
Learning the optimal policy while following an exploratory policy.
What You'll Learn in This Section
- • How to combine off-policy learning with MC control
- • Why this lets us find the TRUE optimal policy
- • Challenges: variance issues with importance sampling
Putting It All Together: The Best of Both Worlds!
Now we combine everything we've learned:
- ✓ Behavior policy b: Exploratory (e.g., ε-greedy) — ensures we try everything
- ✓ Target policy π: Greedy w.r.t. Q — the optimal policy we want to learn!
- ✓ Importance sampling: Corrects for the policy difference
Result: We can learn the truly optimal policy while still exploring!
Off-policy MC Control Algorithm
Follow Behavior Policy b
Use any soft policy (e.g., ε-greedy) to generate episodes. This ensures exploration.
Update Q Using Importance-Weighted Returns
For each (s,a) pair, update Q(s,a) using returns weighted by importance sampling ratio ρ.
Improve Target Policy π
Set π(s) = argmaxa Q(s,a). The target policy is GREEDY — no exploration needed!
Unlike on-policy ε-greedy MC, off-policy MC can learn the truly optimal policy! The target policy can be completely greedy because exploration is handled separately by the behavior policy.
Challenges with Off-policy MC
High Variance
Importance ratios can be huge for long episodes! If π and b differ significantly, ρ can explode (or vanish), making estimates unstable.
Slow Learning
MC must wait for episodes to complete before updating. Long episodes mean slow learning, especially with high-variance importance ratios.
Spoiler: TD Methods Fix These Issues!
In Chapter 6, you'll learn Temporal-Difference (TD) learning. TD combines MC's model-free advantage with DP's bootstrapping:
- ✓ Updates after every step, not just at episode end
- ✓ No need to wait for complete episodes
- ✓ Can work with continuing (non-episodic) tasks
- ✓ Importance sampling only for one step → much lower variance
Summary
Key takeaways from Chapter 5.
Quick Reference Glossary
Monte Carlo
Learning from complete episodes by averaging actual returns (no model needed)
First-Visit MC
Only count the first return for each state per episode
Every-Visit MC
Count returns from all visits to each state
ε-Greedy Policy
Pick best action with prob 1-ε, random with prob ε
On-Policy
Learn about the same policy you're following
Off-Policy
Learn about one policy while following another
Behavior Policy (b)
The policy we follow to collect data (exploratory)
Target Policy (π)
The policy we want to learn about (can be optimal)
Importance Sampling
Reweighting returns to correct for policy mismatch
Exploring Starts
Every state-action pair has nonzero starting probability
Formula Cheat Sheet
ε-Greedy Policy:
Importance Sampling Ratio:
Ordinary IS Estimate:
Weighted IS Estimate:
Common Mistakes to Avoid
"MC can be used for continuing tasks"
Correct: MC requires episodes to END. For tasks without natural endings, use TD methods (Chapter 6).
"On-policy ε-greedy MC finds the optimal policy"
Correct: It finds the best ε-soft policy. For the TRUE optimal, you need off-policy methods.
"MC and DP give the same results immediately"
Correct: MC converges to the same values, but needs many samples. DP is exact (with a model), MC learns from experience.
Key Takeaways from This Chapter
🎲 Monte Carlo = Learning from experience — no model of the environment needed!
📊 Core idea: Average the returns you actually observe to estimate values
🔄 MC Control: Same GPI pattern as DP (evaluate → improve → repeat)
⚖️ ε-greedy: Balance exploration (trying new things) and exploitation (using what you know)
🎯 Off-policy: Separate the exploration policy from the policy you're learning about
📐 Importance sampling: Corrects returns when learning off-policy
Monte Carlo methods learn from complete episodes without a model
MC prediction averages sample returns to estimate value functions
First-visit and every-visit MC both converge to the true values
MC control requires exploration: exploring starts or soft policies
On-policy methods learn about the policy being followed
Off-policy methods learn about one policy using data from another
Importance sampling corrects for the difference between policies
Weighted IS has lower variance but is biased; ordinary IS is unbiased
The Main Limitation: Must Wait for Episodes to End
MC methods have one big drawback: we can't update values until the episode finishes.
Looking Ahead: Chapter 6
In the next chapter, we'll learn Temporal-Difference (TD) learning — the brilliant combination of MC's model-free approach with DP's bootstrapping. TD methods update after every step, not just at episode end, making them much more practical for real-world applications!