Chapter 16

Applications and Case Studies

RL Success Stories

From Tesauro's TD-Gammon in 1992 to DeepMind's AlphaGo in 2016, reinforcement learning has achieved remarkable successes. This chapter showcases landmark applications that demonstrate the power of RL across games, robotics, and real-world systems.

Milestones in RL History

This chapter presents some of the most celebrated applications of reinforcement learning— systems that have achieved superhuman performance in complex domains and demonstrated that RL can solve problems once thought intractable. Each case study illustrates different aspects of RL methodology.

Board Games

Hardware

Web Systems

Robotics

Section 16.1

TD-Gammon

TD-Gammon, created by Gerald Tesauro at IBM in the early 1990s, stands as one of the most influential demonstrations of reinforcement learning. It achieved world-class backgammon play through self-play, using temporal-difference learning with a neural network function approximator—years before "deep learning" was a household term.

Why Backgammon?

Backgammon is an ideal testbed for RL because it combines:

  • Stochasticity: Dice rolls add randomness, preventing pure memorization
  • Large state space: ~10^20 possible positions (too large for tabular methods)
  • Clear evaluation: Win/loss outcomes provide clean training signals
  • Human experts: World champions provide benchmarks for evaluation

Architecture

TD-Gammon used a neural network with a simple architecture by modern standards:

TD-Gammon Neural Network

Input Layer (~198 units)

Raw encoding of board position—number of checkers at each point

Hidden Layer (40-80 units)

Single hidden layer with sigmoid activation—remarkably small!

Output (4 units)

Probabilities of winning: normal win, gammon, backgammon (for each player)

Learning Algorithm

TD-Gammon used TD(λ) with eligibility traces, updating after every move:

ww+α[V(St+1)V(St)]k=1tλtkwV(Sk)w \leftarrow w + \alpha \left[ V(S_{t+1}) - V(S_t) \right] \sum_{k=1}^{t} \lambda^{t-k} \nabla_w V(S_k)

The key innovations:

  • Self-play: The network played against itself, generating training data
  • No human knowledge: Later versions used only raw board representation
  • Continuous learning: Weights updated after every move, not just game end
  • Afterstates: Evaluated positions after player's move but before dice roll
The Power of Self-Play

TD-Gammon discovered something remarkable: through pure self-play with TD learning, it independently developed strategies that matched and even surpassed human expert knowledge. Some of its opening moves were initially criticized by experts but later adopted as superior!

Results and Impact

Performance

TD-Gammon 2.1 achieved play at the level of the world's best human players. It played matches against world champions and competed respectably.

Legacy

TD-Gammon proved that RL + neural networks could master complex games through self-play—the same approach later used by AlphaGo and AlphaZero.

Lessons from TD-Gammon

TD-Gammon showed that: (1) self-play can generate expert-level knowledge; (2) TD learning works with nonlinear function approximation despite theoretical concerns about convergence; (3) simple architectures can achieve remarkable results with the right learning algorithm.

Section 16.2

Samuel's Checkers Player

Arthur Samuel's checkers-playing program, developed at IBM in the 1950s and 1960s, was a pioneering achievement in machine learning—indeed, Samuel coined the term "machine learning" itself! His work foreshadowed many concepts that became central to reinforcement learning decades later.

Historical Context

Samuel began this work in 1952, running on an IBM 701—a computer with only about 10,000 words of memory! His program eventually achieved strong amateur level play and defeated a state champion in 1962. This was decades before modern computing power made game-playing AI seem feasible.

Key Innovations

Linear Evaluation Function

Samuel used a weighted linear combination of features to evaluate positions:

V(s)=w1f1(s)+w2f2(s)++wnfn(s)V(s) = w_1 \cdot f_1(s) + w_2 \cdot f_2(s) + \cdots + w_n \cdot f_n(s)

Features included piece counts, advancement, control of center, mobility, etc.

Temporal-Difference Ideas

Samuel's learning method was essentially TD learning before the theory existed! He updated evaluations of earlier positions toward the evaluations of later positions reached during play—backing up values through the game tree.

Minimax Search

The program combined learned evaluation with tree search (minimax with alpha-beta pruning), looking several moves ahead. The evaluation function scored leaf nodes, and the search propagated these to choose moves.

Learning from Self-Play

Like TD-Gammon decades later, Samuel used self-play to generate training data:

  • The program played against a copy of itself
  • The copy used fixed weights (a "target network" before its time!)
  • After sufficient games, the copy's weights were updated to match the learner
  • This prevented oscillation and instability in learning
Samuel's Signature Table

Samuel also used a form of experience replay—he stored evaluations of previously seen positions in a "signature table" (essentially a hash table). When a position was encountered again, the stored evaluation could be used directly, dramatically speeding up search.

Why It Matters

Samuel's work anticipated many modern RL techniques:

Temporal-difference learning (before Sutton formalized it)

Self-play (rediscovered by TD-Gammon, AlphaGo)

Target networks (used in DQN decades later)

Experience replay (his signature table)

A Pioneer's Legacy

Samuel showed that machines could improve through experience—learning patterns that their programmers didn't explicitly encode. His work laid conceptual groundwork for modern RL, even if computational limitations prevented achieving today's results.

Section 16.3

Watson's Daily-Double Wagering

IBM Watson famously defeated human champions at Jeopardy! in 2011. While most attention focused on its natural language processing, Watson also used reinforcement learning for a critical decision: how much to wager on Daily Double questions.

The Wagering Problem

Daily Doubles are hidden questions where players can wager any amount up to their current total (or $1,000 if they have less). The optimal wager depends on: confidence in answering correctly, current scores, game stage, and opponent tendencies. This is a classic sequential decision problem—perfect for RL!

Why RL for Wagering?

The wagering decision requires reasoning about:

Long-term Consequences

A wager's value depends on future game dynamics—how it affects probability of winning the entire game, not just that question.

Opponent Modeling

Opponents' likely responses to different score scenarios must be considered— this is a multi-agent problem.

Uncertainty

Confidence in the answer, opponent behavior, and future questions are all uncertain—the agent must reason under uncertainty.

Watson's Approach

Watson used simulation-based RL to learn wagering strategy:

  • State: Current scores, game stage, remaining questions, confidence levels
  • Actions: Discrete wager amounts (e.g., increments of $1,000)
  • Reward: +1 for winning the game, 0 otherwise
  • Training: Monte Carlo simulations of complete games
Simulation-Based Optimization

Watson simulated thousands of games to evaluate different wagering strategies. This is similar to Monte Carlo Tree Search—using forward simulation to estimate long-term value of immediate decisions.

Results

Watson's RL-based wagering contributed significantly to its victories:

In the televised matches against Ken Jennings and Brad Rutter, Watson made strategically sound wagers that maximized its winning probability—including surprisingly low wagers when it had commanding leads (to minimize risk) and appropriately aggressive wagers when behind.

This application demonstrates RL's value even as a component within larger AI systems—handling specific decisions where sequential optimization matters.

Section 16.4

Optimizing Memory Control

Reinforcement learning has found important applications in computer systems optimization, where the goal is to maximize performance through intelligent resource management. One compelling example is DRAM memory scheduling.

The Memory Scheduling Problem

Modern DRAM (Dynamic Random-Access Memory) is organized into banks, rows, and columns. Accessing data requires specific sequences of commands, and some access patterns are much faster than others. The memory controller must decide which pending requests to service—a sequential decision problem!

Why This Matters

Memory performance is critical because:

  • The CPU often waits for memory—it's a major bottleneck
  • Row hits (accessing open rows) are ~3x faster than row conflicts
  • Multi-core systems create complex access patterns
  • Traditional heuristic schedulers become suboptimal

RL Formulation

MDP Setup

State:Current DRAM state (open rows, queue contents, timing constraints)
Actions:Which request to schedule next (from pending queue)
Reward:Negative latency (minimize time to complete requests)
Algorithm:Sarsa with function approximation

Approach: Self-Optimizing Memory Controller

Researchers developed memory controllers that learn online:

Feature-Based State Representation

Hand-crafted features capture relevant state information: queue lengths, row buffer status, bank conflicts, fairness metrics, etc.

Linear Function Approximation

Simple linear models keep hardware implementation practical—RL must run at memory controller speed (nanoseconds per decision)!

Online Adaptation

The controller continuously learns and adapts to changing workloads— different applications have different access patterns.

Results

RL-based memory controllers demonstrated 15-25% performance improvements over traditional schedulers on multi-core workloads. More importantly, they automatically adapted to new workload patterns without manual tuning—showing how RL can optimize even low-level system components.

Broader Impact

This work inspired RL applications across computer systems: cache management, job scheduling, network routing, compiler optimization, and datacenter management. The key insight is that many systems problems are inherently sequential decision problems—exactly what RL is designed to solve.

Section 16.5

Human-level Video Game Play

In 2013-2015, DeepMind demonstrated that deep reinforcement learning could achieve human-level performance across dozens of Atari 2600 games, learning directly from raw pixels. This Deep Q-Network (DQN) breakthrough sparked the modern era of deep RL.

Why Atari Games?

Atari games provide an ideal benchmark for several reasons:

  • Visual input: Raw pixels force learning perception + control
  • Diverse challenges: 50+ games with different dynamics
  • Human baselines: Professional testers provide comparison
  • Deterministic emulation: Reproducible experiments

DQN Architecture

Deep Q-Network

Input: 84×84×4 grayscale frames

Last 4 frames stacked to capture motion

Convolutional layers (3 layers)

Extract visual features: edges, objects, spatial patterns

Fully connected layers (2 layers)

Combine features into action-value predictions

Output: Q(s,a) for each action

One value per possible joystick action (up to 18)

Key Innovations

DQN introduced two critical techniques that made deep RL stable:

Experience Replay

Store transitions (s, a, r, s') in a buffer and sample random mini-batches for training. This breaks correlations and reuses data efficiently.

Buffer size: ~1 million transitions

Target Network

Use a separate, slowly-updated network to compute TD targets. This prevents instability from chasing a moving target.

Updated every 10,000 steps

Learning Algorithm

DQN minimizes the squared TD error:

L(θ)=E(s,a,r,s)D[(r+γmaxaQ(s,a;θ)Q(s,a;θ))2]L(\theta) = \mathbb{E}_{(s,a,r,s') \sim D} \left[ \left( r + \gamma \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta) \right)^2 \right]

where θ\theta^- are the target network parameters andDD is the replay buffer.

End-to-End Learning

DQN learns "end-to-end"—from raw pixels to actions, the entire pipeline is differentiable. No hand-crafted features, no preprocessing besides frame stacking. The same architecture and hyperparameters worked across all 49 games!

Results

Performance Across Games

Superhuman: Breakout, Video Pinball, Boxing, and ~20 others

Human-level: Pong, Space Invaders, Enduro, and ~10 others

Struggles: Montezuma's Revenge, Pitfall (sparse rewards, exploration)

Impact and Extensions

DQN sparked an explosion of deep RL research:

  • Double DQN: Addresses overestimation bias
  • Dueling DQN: Separate value and advantage streams
  • Prioritized Replay: Sample important transitions more often
  • Rainbow: Combines all improvements for state-of-the-art
A New Era

DQN's 2015 Nature paper marked deep RL's arrival as a serious AI paradigm. It demonstrated that neural networks could learn complex behaviors from raw sensory input—opening the door to applications in robotics, autonomous vehicles, and beyond.

Section 16.6

Mastering the Game of Go

Go was long considered AI's "grand challenge"—a game so complex that traditional game-tree search couldn't crack it. DeepMind's AlphaGo (2016) and AlphaGo Zero (2017) not only solved this challenge but revolutionized our understanding of what self-play RL can achieve.

Why Go is Hard

Enormous complexity:

  • • ~10^170 possible positions
  • • ~250 legal moves per turn
  • • Games last ~200 moves

Evaluation difficulty:

  • • No simple material count
  • • Position quality is subtle
  • • Patterns span entire board

AlphaGo (2016)

The original AlphaGo combined several techniques:

Policy Network

A deep CNN predicting move probabilities: p(as)p(a|s). Initially trained on 30 million human expert moves via supervised learning, then improved through self-play with policy gradient.

Value Network

Another deep CNN predicting position values: V(s)V(s). Trained on self-play games to predict winners from any position.

Monte Carlo Tree Search (MCTS)

Combined policy (to guide exploration) and value (to evaluate leaves) networks with MCTS for actual gameplay. The networks make search tractable.

Historic Victory

In March 2016, AlphaGo defeated Lee Sedol, one of the world's best Go players, 4-1 in a highly publicized match. Move 37 in Game 2 became famous—a creative move that surprised experts and revealed AlphaGo had developed novel strategies.

AlphaGo Zero (2017)

AlphaGo Zero took the approach further with a stunning simplification:

Zero Human Knowledge

  • No human games: Learned purely from self-play
  • Simpler architecture: Single network for both policy and value
  • Simpler features: Just raw board position (stone locations)
  • Stronger performance: Defeated original AlphaGo 100-0!

Learning Algorithm

AlphaGo Zero uses a elegant loop:

1

Self-Play with MCTS

Current network plays games against itself, using MCTS to select moves

2

Generate Training Data

Record (state, MCTS policy, game outcome) for each position

3

Train Network

Match policy head to MCTS policy, value head to game outcome

4

Repeat

Use improved network for next round of self-play

Learning Curve

AlphaGo Zero's improvement was remarkably fast:

  • 3 hours: Random moves → beginners level
  • 19 hours: Surpassed all previous Go programs
  • 21 hours: Matched AlphaGo Lee (defeated Lee Sedol)
  • 40 days: Far surpassed all previous versions
Why Self-Play Works

Self-play creates a curriculum of increasingly challenging opponents. As the agent improves, so do its training partners—always providing appropriately difficult challenges. This bootstrapping process discovers knowledge that took humans thousands of years to develop, plus novel strategies humans never considered.

AlphaZero: Generalization

AlphaZero (2018) applied the same algorithm to chess and shogi, achieving superhuman performance in all three games with identical architecture—demonstrating the power of general RL algorithms.

Go

Defeated AlphaGo 100-0

Chess

Defeated Stockfish

Shogi

Defeated Elmo

Creative Play

AlphaZero developed distinctive, "alien" playing styles in all three games. In chess, it often sacrifices material for long-term positional advantages that engines previously undervalued—strategies now studied by human grandmasters.

Section 16.7

Personalized Web Services

Reinforcement learning powers many of the personalization systems we interact with daily—from news feeds to product recommendations. These applications demonstrate RL's ability to learn from implicit user feedback at massive scale.

Personalization as RL

Web personalization is naturally a sequential decision problem: show content → observe user response → update understanding → show more content. The goal is to maximize long-term engagement while learning user preferences from noisy, implicit feedback.

Contextual Bandits

Many web personalization tasks can be framed as contextual bandits— a simplified RL setting where there's no state transition (each interaction is independent):

Contextual Bandit Setup

Context:User features, time of day, device, recent activity
Actions:Which items/articles/ads to show
Reward:Clicks, purchases, time spent, etc.
Challenge:Explore new content while exploiting known preferences

Applications

Product Recommendations

E-commerce sites like Amazon use RL to decide which products to recommend. The system balances showing items likely to be purchased vs. exploring to discover new user preferences.

News Feed Ranking

Social media platforms use RL to rank posts in personalized feeds. The challenge is optimizing for long-term engagement rather than just immediate clicks—avoiding filter bubbles and clickbait.

Ad Targeting

Online advertising uses RL to select which ads to show. The objective might be clicks, conversions, or customer lifetime value—requiring different RL formulations.

Life-time Value Optimization

A key insight is that immediate rewards (clicks) may not align with long-term value (customer retention). Full RL formulations consider:

LTV=t=0γtE[RevenuetActions]\text{LTV} = \sum_{t=0}^{\infty} \gamma^t \cdot \mathbb{E}[\text{Revenue}_t | \text{Actions}]

This requires modeling user state transitions—how today's recommendations affect tomorrow's engagement. More aggressive RL (vs. bandits) is needed.

Exploration at Scale

At web scale, exploration is crucial but costly—showing suboptimal content loses money! Techniques like Thompson Sampling and Upper Confidence Bounds efficiently balance exploration and exploitation when serving millions of users.

Challenges

  • Delayed feedback: Conversions may happen days later
  • Partial observability: Limited view of user intent
  • Non-stationarity: User preferences change over time
  • Fairness: Avoid amplifying biases or creating filter bubbles
  • Scale: Billions of users, millions of items
Real-World Impact

RL-based personalization has enormous commercial impact. Companies like Netflix, Spotify, YouTube, and Amazon rely on these techniques to improve user experience and drive engagement—making personalization one of RL's most practically significant applications.

Section 16.8

Thermal Soaring

Reinforcement learning enables autonomous gliders to exploit thermal columns— rising currents of warm air—to stay aloft without engines. This application demonstrates RL's ability to learn subtle physical dynamics and outperform hand-crafted controllers.

What is Thermal Soaring?

Thermal columns form when the sun heats the ground unevenly—warm air rises, creating updrafts. Birds and human glider pilots exploit these to gain altitude for free. Finding and staying within thermals requires sophisticated sensing and control—exactly what RL can learn.

The Challenge

Thermal soaring is difficult because:

Thermals are Invisible

The glider can't directly see thermals—it must infer their location from local vertical air speed (sink/climb rate). This is partial observability.

Thermals are Noisy

Real thermals are turbulent, irregular, and varying—not ideal cylinders. The controller must be robust to noisy, uncertain measurements.

Continuous Control

The glider has continuous control surfaces—bank angle, pitch, etc. This requires learning smooth policies, not discrete action selection.

RL Approach

MDP Formulation

State:Glider position, velocity, climb rate, recent history
Actions:Bank angle (turn left/right or straight)
Reward:Altitude gained (stay high as long as possible)
Algorithm:Sarsa with tile coding or neural network

Learning the Strategy

RL agents trained in simulation discovered effective soaring strategies:

  • Center the thermal: When climb rate is strong, tighten the turn
  • Expand the search: When climb rate drops, widen the circle
  • Thermal switching: Leave weak thermals to search for stronger ones
  • Altitude-dependent behavior: More exploration when high, more exploitation when low
Matching Bird Behavior

The strategies learned by RL agents closely match observed behavior of soaring birds like hawks and vultures! This suggests that both biological and artificial learners converge on similar solutions to the same optimization problem.

Real-World Deployment

Researchers have tested RL soaring controllers on actual gliders:

Simulation Results

RL controllers doubled average flight time compared to simple circling strategies, and matched or exceeded expert human pilots in simulation.

Flight Tests

Policies trained in simulation transferred successfully to real gliders, demonstrating robust learning that generalizes across sim-to-real gap.

Broader Implications

Thermal soaring exemplifies several important RL themes:

  • Learning from experience in complex, continuous environments
  • Handling partial observability through appropriate state representation
  • Sim-to-real transfer with robust policies
  • Matching (or exceeding) evolved biological solutions
Energy-Efficient Aviation

Autonomous thermal soaring has practical applications for surveillance, atmospheric research, and energy-efficient long-range flight. Gliders that can exploit thermals can stay aloft indefinitely, dramatically extending mission duration without fuel.

Chapter Summary

This chapter has surveyed landmark applications of reinforcement learning, from the pioneering work of Samuel and Tesauro to modern breakthroughs like AlphaGo and DQN. These case studies demonstrate that RL can:

  • Achieve superhuman performance in complex games through self-play
  • Learn directly from raw sensory input (pixels) without hand-crafted features
  • Optimize real-world systems from hardware to web services
  • Control physical systems like autonomous gliders
  • Discover novel strategies that surpass human knowledge

Key Lessons

Self-play is Powerful

From TD-Gammon to AlphaZero, self-play enables agents to bootstrap from random play to superhuman performance without human examples.

Deep RL Scales

DQN and AlphaGo showed that combining deep learning with RL enables handling high-dimensional inputs and complex value functions.

RL Works Beyond Games

Memory controllers, web services, and robotic control show that RL principles apply wherever sequential decisions optimize long-term outcomes.