Chapter 5

Monte Carlo Methods

Learn from complete episodes without knowing the environment model. MC methods average sample returns to estimate value functions.

Welcome! Here's What This Chapter is About

In Chapter 4, we learned Dynamic Programming (DP) — powerful methods that require a complete model of the environment (all transition probabilities and rewards). But here's the problem: in the real world, we rarely have such a perfect model!

Monte Carlo methods solve this problem by learning directly from experience — actual sequences of states, actions, and rewards. No model needed!

Simple analogy: Imagine learning to navigate a new city.

• DP approach: Study a perfect map with traffic patterns, road conditions, and timing for every route
• MC approach: Just start driving! Take different routes, track how long each trip takes, and average your experiences

Prerequisites (Quick Refresher)

Value Function V(s): Expected total reward from state s

Policy π: Strategy for choosing actions

Return G: Sum of (discounted) rewards in an episode

Model-Free

No dynamics needed

Episodic

Complete episodes

Sample Averages

Unbiased estimates

The Big Picture: Learning from Episodes

What Exactly is an "Episode"?

An episode is one complete experience from start to finish. Think of it like one game of chess, one maze run, or one blackjack hand — it has a clear beginning and end.

Example episode (robot navigating a room):

Start: DoorMove rightHit wall (-1)Move upReach goal (+10)

Total return: -1 + 0 + 10 = +9

🎲 Why "Monte Carlo"?

The name comes from the famous casino in Monaco! Just like gambling involves randomness and averaging outcomes, MC methods use random sampling to estimate values. The key insight: if you flip a coin enough times, you'll discover it's ~50% heads.

🔑 The Key Insight

We don't need to know why certain states lead to good outcomes. We just need to observe what happens and average the results. With enough samples, our averages converge to the true expected values!

🎯 The MC Recipe (3 Simple Steps)

Play through a complete episode

Follow your policy from start to terminal state

Calculate the return for each visited state

Sum up all rewards from that state until episode end

Update each state's value = average of observed returns

More samples → better estimate!

Section 5.1

Monte Carlo Prediction

Estimating state values by averaging sample returns.

What You'll Learn in This Section

• How to estimate V(s) without knowing the environment's dynamics
• The difference between First-Visit and Every-Visit MC
• Why averaging samples gives us the true value (Law of Large Numbers)

Monte Carlo Prediction - Formal Definition

Given a policy π, estimate v_π(s) by averaging returns observed after visiting state s. No model of the environment is needed—just sample episodes.

Deep Dive Analogy: Rating Coffee Shops

Imagine you're trying to figure out the "value" of different neighborhoods for getting coffee. You have no idea about shop ratings, prices, or quality beforehand — you just start visiting.

Your "Episode" = One Coffee Trip

You leave home, walk through neighborhood A, stop at a coffee shop, rate your experience (the "return"), and go home. That entire journey from leaving to returning home is one episode.

Trip 1: Neighborhood A

Great latte! Rating: +8

Trip 2: Neighborhood A

Long wait, ok coffee. Rating: +5

Trip 3: Neighborhood A

Amazing pastry! Rating: +9

Trip 4: Neighborhood A

Crowded, decent. Rating: +6

Your estimate of V(Neighborhood A):
(8 + 5 + 9 + 6) / 4 = 7.0

As you make more trips, this average gets closer to the "true" value of neighborhood A for coffee!

First-Visit vs. Every-Visit: What's the Difference?

Sometimes you visit the same state multiple times within a single episode. For example, you might walk through neighborhood A twice during one coffee trip (going and returning). Should you count both visits or just the first one?

First-Visit MC

Only use the return from the first time you visit state s in each episode. Ignore subsequent visits.

Example episode: A → B → A → C → Goal

For state A: Only count the return from the first A, ignore the second A

Mathematically simpler to analyze

Unbiased estimate

More commonly used

Every-Visit MC

Use the return from every visit to state s, even if you visit multiple times per episode.

Example episode: A → B → A → C → Goal

For state A: Count returns from both visits to A

Uses more data per episode

Also converges to true value

Easier to implement incrementally

Bottom line: Both methods converge to the true value V(s). First-Visit is more common in theory; Every-Visit is sometimes easier to implement. For most practical purposes, the difference is minor.

Interactive First-Visit MC Demo

State Values (First-Visit MC)

TERM

0.0n=0

TERM

Episodes

Max Visits

First-Visit MC Update

V(s) ← average(Returns(s))

Each state's value is the average of all returns observed after first visiting that state.

Recent Episodes

No episodes yet

Watch how values converge as more episodes are sampled. States visited more often have more accurate estimates.

MC vs DP: A Detailed Comparison

Aspect	Dynamic Programming	Monte Carlo
Model Required?	Yes - need p(s',r\|s,a)	No - learn from experience
Bootstrapping?	Yes - uses V(s') estimates	No - uses actual returns
Updates	All states every sweep	Only visited states
When to Update?	Any time (synchronous)	After episode ends
Analogy	Planning with a perfect map	Learning by exploring

Why Does Averaging Work? (The Math Behind MC)

The magic of MC comes from a fundamental principle in probability: the Law of Large Numbers.

The Law of Large Numbers (Simplified)

As you collect more and more samples, their average converges to the true expected value.

\frac{1}{n}\sum_{i=1}^{n} G_i \xrightarrow{n \to \infty} \mathbb{E}[G] = v_\pi(s)

Average of n returns → True value (as n gets large)

n=10

Rough estimate

n=100

Better estimate

n=1000

Very accurate!

For the Curious Mind

Monte Carlo Methods Predate Computers!

The Monte Carlo method was invented during the Manhattan Project in the 1940s. Scientists like Stanislaw Ulam and John von Neumann used random sampling to simulate neutron diffusion — problems too complex for analytical solutions. They ran simulations by hand and with early mechanical calculators!

Why Not Just Use the Latest Return?

You might wonder: why average many returns instead of just using the most recent one? The answer is variance. A single return can be unusually high or low due to randomness. Averaging smooths out this noise. It's like asking many people for directions vs. asking just one person.

MC Powers Modern Game AI!

Monte Carlo Tree Search (MCTS) powered AlphaGo's victory over the world champion. The idea: "From this position, simulate many random games. Pick the move that wins most often." Combined with neural networks, this became unbeatable at Go!

Section 5.2

Monte Carlo Estimation of Action Values

When we don't have a model, we need action values for control.

What You'll Learn in This Section

• Why state values V(s) aren't enough for model-free control
• What action values Q(s,a) are and why we need them
• The exploration problem and why it matters

The Big Question: Why Do We Need Action Values?

In Chapter 4 (Dynamic Programming), knowing V(s) was enough. Why? Because we had a model! We could "look ahead" and calculate: "If I take action A in state S, what state S' will I land in, and what's V(S')?"

The Problem Without a Model

Without knowing transition probabilities, we can't compute which action leads to the best next state! Knowing V(s) tells us "how good is this state?" but NOT "which action should I take?"

The Solution: Action Values Q(s,a)

Q(s,a) directly tells us: "How good is it to take action A in state S?"

To choose the best action, we simply pick: argmax_a Q(s,a)
No model needed — just compare Q values!

Why Action Values?

Without a model, state values alone are not enough for control. We need q(s,a) to know which action to take without computing expected values over unknown transitions.

The Exploration Problem (Very Important!)

Here's a critical challenge: if your policy always picks the same action, you'll never learn about alternatives!

🍕 The Pizza Analogy

Imagine you always order pepperoni pizza. You've rated pepperoni as 8/10. But you've never tried the margherita, so Q(restaurant, margherita) = unknown!

What if margherita is actually 10/10? You'll never know unless you explore!

For MC to work, we need to ensure every state-action pair is visited. Two main approaches:

Exploring Starts

Every episode starts from a randomly chosen (state, action) pair. This guarantees every pair gets tried eventually.

💡 Like randomly teleporting to different restaurant tables with random orders

⚠️ Often unrealistic — you can't always start from arbitrary states

Soft Policies

The policy gives nonzero probability to ALL actions: π(a|s) > 0 for all s, a. So every action can be chosen!

💡 Like occasionally ordering something random from the menu

✓ More practical — can be used in real environments

Blackjack Example

Blackjack is a perfect MC example! We don't need to know card probabilities — we just play many hands and average results. The state is (your sum, dealer's showing card, usable ace?). Actions are "hit" or "stick."

Blackjack

Click to start a new game

Episodes

Win Rate

Results

Wins

Losses

Draws

State Space

• Player sum: 12-21 (10 values)
• Dealer showing: 1-10 (10 values)
• Usable ace: yes/no (2 values)
• Total: 200 states

MC methods learn the value of each state by averaging returns from many episodes. The state is (player sum, dealer showing, usable ace).

Section 5.3

Monte Carlo Control

Using MC for both prediction and improvement.

What You'll Learn in This Section

• How to use MC for control (finding optimal policies), not just prediction
• The MC version of Generalized Policy Iteration (GPI)
• Why "Exploring Starts" guarantees convergence

From Prediction to Control: The Same Recipe, Different Ingredients

Remember Generalized Policy Iteration (GPI) from Chapter 4? The beautiful insight: we alternate between evaluating a policy and improving it. MC control uses the exact same pattern!

DP Version (Chapter 4)

Evaluate: Compute V(s) using Bellman equations
Improve: π(s) = argmax over model-computed expectations

MC Version (This Chapter)

Evaluate: Estimate Q(s,a) by averaging returns
Improve: π(s) = argmax_a Q(s,a)

MC Control with Exploring Starts (Step by Step)

Generate Episode with Exploring Starts

Pick a random state S₀ and random action A₀ to start. Then follow the current policy π until the episode ends.

💡 This ensures we eventually try every (state, action) pair

Update Q Values (Evaluation)

For each (state, action) pair visited in the episode, calculate the return G (sum of rewards from that point onward). Update Q(s,a) to be the average of all such returns.

Q(s,a) \leftarrow \text{average}(\text{Returns}(s,a))

Improve Policy (Greedy)

For each state visited, update the policy to pick the action with the highest Q value.

\pi(s) \leftarrow \arg\max_a Q(s,a)

Repeat!

Go back to step 1 and generate another episode. Keep iterating — Q values get better, policy gets better, which generates better episodes, which gives better Q values...

GPI in Monte Carlo

MC control follows the same GPI pattern as DP: evaluate the policy, then improve it greedily. The difference is we use sampled returns instead of computed expectations. Both converge to the optimal policy!

The MC Control Loop (Visual)

Generate

Play episode

Evaluate

Update Q

Improve

Greedy π

Repeat many times → Converges to optimal!

The Catch: Exploring Starts is Often Unrealistic

MC control with exploring starts works great in theory. But in practice:

✗You can't "teleport" a self-driving car to random positions
✗A robot can't start in arbitrary physical configurations
✗In games, you usually start from the initial position

Solution: Next section shows how to explore without needing exploring starts!

Section 5.4

Monte Carlo Control without Exploring Starts

Using soft policies to ensure exploration.

What You'll Learn in This Section

• ε-greedy policies: balancing exploration and exploitation
• Why "soft" policies ensure exploration without special starts
• The trade-off: finding "best soft policy" vs true optimal

The Elegant Solution: Sometimes Act Randomly!

Since we can't always start from random states, we need another way to explore all actions. The brilliant idea: what if we occasionally take random actions during normal execution?

ε-greedy Policy (The Most Common Solution)

Most of the time (1-ε): Take the best action according to current Q values
Sometimes (ε): Take a completely random action

Example: With ε=0.1, you'll explore 10% of the time and exploit 90% of the time

ε-Soft Policies

A policy is ε-soft if π(a|s) ≥ ε/|A| for all states and actions. This guarantees all actions are tried with at least probability ε/|A|. The most common ε-soft policy is ε-greedy.

The ε-Greedy Policy (Detailed Breakdown)

\pi(a|s) = \begin{cases} 1 - \varepsilon + \varepsilon/|A| & \text{if } a = \arg\max_a Q(s,a) \\ \varepsilon/|A| & \text{otherwise} \end{cases}

📖 Breaking Down Each Part:

π(a|s)

Probability of taking action a in state s

This is what we're calculating — how likely is each action?

Exploration rate (a number between 0 and 1)

Common values: 0.1 (10% exploration) or 0.05 (5% exploration)

|A|

Number of possible actions

E.g., if actions are {up, down, left, right}, then |A| = 4

argmax

The action with the highest Q value (the "greedy" action)

This is the action we'd always take if we weren't exploring

🔢 Worked Example (ε = 0.1, 4 actions):

Suppose Q values for state S are: Q(S, up)=5, Q(S, down)=3, Q(S, left)=2, Q(S, right)=4

Best action: "up" (highest Q = 5)

Probabilities:

P(up) = 1 - 0.1 + 0.1/4 = 0.925

P(down) = 0.1/4 = 0.025

P(left) = 0.1/4 = 0.025

P(right) = 0.1/4 = 0.025

Sanity check: 0.925 + 0.025 + 0.025 + 0.025 = 1.0 ✓

Exploration (ε portion)

Random actions ensure we keep discovering new information. Without this, we might get stuck with a suboptimal policy forever!

Like occasionally trying a new restaurant instead of your usual

Exploitation (1-ε portion)

Taking the best-known action maximizes immediate reward. This is how we benefit from what we've already learned!

Like going to your favorite restaurant that you know is good

Important Trade-off: Suboptimality

On-policy ε-greedy MC converges to the best ε-soft policy, NOT the truly optimal policy. Why? Because we're always exploring with probability ε, so we can't be fully greedy.

To find the true optimal policy, we need off-policy methods (next sections).

For the Curious Mind

The Explore-Exploit Dilemma is Everywhere!

This isn't just an RL problem — it's a fundamental life dilemma! Should you go to your favorite restaurant or try somewhere new? Should a company invest in existing products or R&D? Should you stick with your job or interview elsewhere?

Smarter Exploration Exists!

ε-greedy is simple but "blind" — it explores randomly even for well-understood actions. Advanced methods like UCB and Thompson Samplingexplore intelligently: more for uncertain actions, less for actions we know well.

Section 5.5

Off-policy Prediction via Importance Sampling

Learning about one policy while following another.

What You'll Learn in This Section

• On-policy vs off-policy learning: what's the difference?
• Why off-policy is powerful: separate exploration from learning
• Behavior policy vs target policy concepts

The Breakthrough Idea: Separate Learning from Exploring!

With ε-greedy, we were limited — we could only learn the best ε-soft policy, not the true optimal. Why? Because the same policy had to both explore AND be optimal.

The Off-Policy Solution

What if we use two different policies?

• Behavior policy (b): What we actually do — exploratory, tries everything
• Target policy (π): What we want to learn about — can be greedy/optimal!

We follow b (to explore), but we learn about π (which can be optimal). Best of both worlds!

On-Policy

Learn from own actions

Generate dataImprove π

Same policy is used for both:

• Generating experience (behavior)
• Being evaluated/improved (target)

Simpler implementation

Lower variance

Must explore (ε-greedy)

Cannot learn from other agents

Example: SARSA, ε-greedy MC Control

Off-Policy

Learn from others' actions

Generate data

Different policies:

• Behavior (b): generates data
• Target (π): being evaluated

Learn optimal policy directly

Can reuse experience from any policy

Learn from human demonstrations

Higher variance (importance sampling)

Example: Q-learning, Off-policy MC Control

Coverage Requirement

For off-policy learning to work, the behavior policy must have coverage: it must take all actions that the target policy might take. Otherwise, we can't learn about those actions.

π(a|s) > 0 ⟹ b(a|s) > 0

On-Policy vs Off-Policy (Key Distinction)

On-Policy

Learn about the same policy you're using to collect data.

✓ Simpler to implement

✗ Can't find truly optimal policy while exploring

Example: ε-greedy MC control (Section 5.4)

Off-Policy

Learn about a different policy than the one collecting data.

✓ Can learn optimal policy while exploring

✗ More complex (needs importance sampling)

Example: Q-learning (Chapter 6), this section

Real-World Analogy: Learning from a Mentor

Imagine you want to learn the optimal strategy for day trading, but you're risk-averse.

Behavior Policy (Your Mentor)

Your mentor makes risky trades, explores different strategies, sometimes loses money. You watch and record everything they do.

Target Policy (What You Want to Learn)

Using your mentor's experiences, you figure out what the optimal strategy would be — without ever taking those risks yourself!

Key insight: You learn from their experiences, but reweight them based on how likely YOU would have made those same choices. That's importance sampling!

Section 5.6

Importance Sampling

Correcting for the difference between behavior and target policies.

What You'll Learn in This Section

• The importance sampling ratio and what it means intuitively
• Ordinary vs weighted importance sampling
• The bias-variance trade-off between the two methods

The Core Problem Importance Sampling Solves

We have data from policy b, but we want to estimate values for policy π. The problem: the same trajectory has different probabilities under different policies!

🎯 Example

Suppose under behavior policy b, action sequence [left, left, right] happens 40% of the time. But under target policy π, the same sequence would only happen 5% of the time.

If we just average returns from b, we'd be overweighting trajectories that π rarely takes! We need to correct for this.

The Solution: Reweight by the Ratio

Multiply each return by: (probability under π) / (probability under b)
This "corrects" the sample to reflect what π would have experienced!

The Importance Sampling Ratio (Step by Step)

\rho_{t:T-1} = \prod_{k=t}^{T-1} \frac{\pi(A_k|S_k)}{b(A_k|S_k)}

📖 Understanding Each Symbol:

ρ_t:T-1

The importance sampling ratio

A single number that corrects the entire trajectory

Π (product)

Multiply all terms together

Like Σ (sum) but multiplying instead of adding

π(A_k|S_k)

Probability target policy π would take action A_k

What we WANT to learn about

b(A_k|S_k)

Probability behavior policy b took action A_k

What actually happened during data collection

🔢 Worked Example (3-step trajectory):

Trajectory: S₀ → A₀ → S₁ → A₁ → S₂ → A₂ → Terminal

Step	π(A\|S)	b(A\|S)	Ratio
k=0	0.8	0.4	2.0
k=1	0.6	0.3	2.0
k=2	0.1	0.5	0.2

Total ρ = 2.0 × 2.0 × 0.2 = 0.8

Interpretation: This trajectory is 80% as likely under π as under b. So we slightly downweight its return when estimating V_π.

Interactive Importance Sampling Demo

Behavior Policy π_b(A) = 0.50

Policy we use to collect data

Target Policy π(A) = 0.80

Policy we want to evaluate

Number of Samples: 10

Importance Sampling Ratio

ρ = π(a|s) / b(a|s)

The ratio corrects for the difference between the two policies. When π(a) > b(a), we upweight that sample. When π(a) < b(a), we downweight it.

0.800

True Value

0.800

Ordinary IS

0.800

Weighted IS

Sample Data

Br=0ρ=0.40ρr=0.00

Ar=1ρ=1.60ρr=1.60

Br=0ρ=0.40ρr=0.00

Ar=1ρ=1.60ρr=1.60

Br=0ρ=0.40ρr=0.00

Ar=1ρ=1.60ρr=1.60

Ordinary IS

Simple average of weighted returns. Unbiased but high variance.

Weighted IS

Normalize by sum of ratios. Biased but lower variance.

Try setting the policies very different (e.g., b=0.2, π=0.8) to see high variance in ordinary IS. Weighted IS is more stable but slightly biased.

Ordinary Importance Sampling

V(s) = \frac{\sum_{t \in \mathcal{T}(s)} \rho_t G_t}{|\mathcal{T}(s)|}

Plain English: Sum of (ratio × return), divided by number of samples

✓ Unbiased (correct on average)

✗ High variance (can swing wildly)

Weighted Importance Sampling

V(s) = \frac{\sum_{t \in \mathcal{T}(s)} \rho_t G_t}{\sum_{t \in \mathcal{T}(s)} \rho_t}

Plain English: Sum of (ratio × return), divided by sum of ratios

✗ Biased (systematically off)

✓ Lower variance (more stable)

The Bias-Variance Trade-off (Important Concept!)

This is a fundamental trade-off in machine learning that you'll see again and again:

Bias

How far off is our estimate on average? If we repeated the experiment many times, would we get the right answer on average?

Ordinary IS: No bias (correct on average)
Weighted IS: Has bias (systematically off)

Variance

How much do our estimates fluctuate? If we repeated the experiment, how much would results vary?

Ordinary IS: High variance (wild swings)
Weighted IS: Low variance (more stable)

Bottom line: Weighted IS is usually preferred in practice because its bias disappears with more samples, while high variance can make learning unstable.

Section 5.7

Off-policy Monte Carlo Control

Learning the optimal policy while following an exploratory policy.

What You'll Learn in This Section

• How to combine off-policy learning with MC control
• Why this lets us find the TRUE optimal policy
• Challenges: variance issues with importance sampling

Putting It All Together: The Best of Both Worlds!

Now we combine everything we've learned:

✓ Behavior policy b: Exploratory (e.g., ε-greedy) — ensures we try everything
✓ Target policy π: Greedy w.r.t. Q — the optimal policy we want to learn!
✓ Importance sampling: Corrects for the policy difference

Result: We can learn the truly optimal policy while still exploring!

Off-policy MC Control Algorithm

Follow Behavior Policy b

Use any soft policy (e.g., ε-greedy) to generate episodes. This ensures exploration.

Update Q Using Importance-Weighted Returns

For each (s,a) pair, update Q(s,a) using returns weighted by importance sampling ratio ρ.

Improve Target Policy π

Set π(s) = argmax_a Q(s,a). The target policy is GREEDY — no exploration needed!

Why Off-policy MC Control is Powerful

Unlike on-policy ε-greedy MC, off-policy MC can learn the truly optimal policy! The target policy can be completely greedy because exploration is handled separately by the behavior policy.

Challenges with Off-policy MC

High Variance

Importance ratios can be huge for long episodes! If π and b differ significantly, ρ can explode (or vanish), making estimates unstable.

Example: 10-step episode where each step has ratio 2 → ρ = 2¹⁰ = 1024!

Slow Learning

MC must wait for episodes to complete before updating. Long episodes mean slow learning, especially with high-variance importance ratios.

This is a fundamental limitation of MC methods.

Spoiler: TD Methods Fix These Issues!

In Chapter 6, you'll learn Temporal-Difference (TD) learning. TD combines MC's model-free advantage with DP's bootstrapping:

✓ Updates after every step, not just at episode end
✓ No need to wait for complete episodes
✓ Can work with continuing (non-episodic) tasks
✓ Importance sampling only for one step → much lower variance

Section 5.8

Summary

Key takeaways from Chapter 5.

Quick Reference Glossary

Monte Carlo

Learning from complete episodes by averaging actual returns (no model needed)

First-Visit MC

Only count the first return for each state per episode

Every-Visit MC

Count returns from all visits to each state

ε-Greedy Policy

Pick best action with prob 1-ε, random with prob ε

On-Policy

Learn about the same policy you're following

Off-Policy

Learn about one policy while following another

Behavior Policy (b)

The policy we follow to collect data (exploratory)

Target Policy (π)

The policy we want to learn about (can be optimal)

Importance Sampling

Reweighting returns to correct for policy mismatch

Exploring Starts

Every state-action pair has nonzero starting probability

Formula Cheat Sheet

ε-Greedy Policy:

\pi(a|s) = \begin{cases} 1 - \varepsilon + \varepsilon/|A| & a = \arg\max Q \\ \varepsilon/|A| & \text{otherwise} \end{cases}

Importance Sampling Ratio:

\rho_{t:T-1} = \prod_{k=t}^{T-1} \frac{\pi(A_k|S_k)}{b(A_k|S_k)}

Ordinary IS Estimate:

V(s) = \frac{\sum_{t \in \mathcal{T}(s)} \rho_t G_t}{|\mathcal{T}(s)|}

Weighted IS Estimate:

V(s) = \frac{\sum_{t \in \mathcal{T}(s)} \rho_t G_t}{\sum_{t \in \mathcal{T}(s)} \rho_t}

Common Mistakes to Avoid

"MC can be used for continuing tasks"

Correct: MC requires episodes to END. For tasks without natural endings, use TD methods (Chapter 6).

"On-policy ε-greedy MC finds the optimal policy"

Correct: It finds the best ε-soft policy. For the TRUE optimal, you need off-policy methods.

"MC and DP give the same results immediately"

Correct: MC converges to the same values, but needs many samples. DP is exact (with a model), MC learns from experience.

Key Takeaways from This Chapter

🎲 Monte Carlo = Learning from experience — no model of the environment needed!

📊 Core idea: Average the returns you actually observe to estimate values

🔄 MC Control: Same GPI pattern as DP (evaluate → improve → repeat)

⚖️ ε-greedy: Balance exploration (trying new things) and exploitation (using what you know)

🎯 Off-policy: Separate the exploration policy from the policy you're learning about

📐 Importance sampling: Corrects returns when learning off-policy

Monte Carlo methods learn from complete episodes without a model

MC prediction averages sample returns to estimate value functions

First-visit and every-visit MC both converge to the true values

MC control requires exploration: exploring starts or soft policies

On-policy methods learn about the policy being followed

Off-policy methods learn about one policy using data from another

Importance sampling corrects for the difference between policies

Weighted IS has lower variance but is biased; ordinary IS is unbiased

The Main Limitation: Must Wait for Episodes to End

MC methods have one big drawback: we can't update values until the episode finishes.

⏳Long episodes = slow learning

♾️Continuing tasks? Can't use MC

📉High variance with long trajectories

Looking Ahead: Chapter 6

In the next chapter, we'll learn Temporal-Difference (TD) learning — the brilliant combination of MC's model-free approach with DP's bootstrapping. TD methods update after every step, not just at episode end, making them much more practical for real-world applications!

Chapter 4 Continue to Chapter 6