Frontiers
The Future of Reinforcement Learning
In this final chapter, we explore topics beyond the book's scope but crucial for RL's future: general value functions, temporal abstraction, partial observability, reward design challenges, and RL's role in artificial intelligence.
Looking Ahead
This chapter touches on topics that bring us beyond what is reliably known, and some bring us beyond the MDP framework itself. These are the frontiers where active research is pushing the boundaries of what reinforcement learning can achieve.
Abstraction
Partial Observability
Real-World Safety
General Value Functions and Auxiliary Tasks
Throughout this book, our notion of value function has become quite general. With off-policy learning we allowed value functions conditional on arbitrary target policies. With termination functions, we generalized discounting. The final step is to generalize beyond rewards to permit predictions about arbitrary signals.
General Value Functions (GVFs)
Rather than predicting the sum of future rewards, we might predict the sum of future values of any signal—a sound, a color sensation, or even another prediction. Whatever signal is accumulated, we call it the cumulant.
General Value Function Definition
— the policy being followed
— state-dependent termination function
— the cumulant signal (replaces reward)
The cumulant is any signal that can be accumulated in a value-function-like prediction. It might be a sensory signal, an internal prediction, or any other scalar quantity the agent can observe.
Why Predict Many Signals?
Auxiliary tasks are predictions/controls of signals beyond the main reward. They can help in several ways:
Shared Representations
Auxiliary tasks may require some of the same representations as the main task. If good features are found on easier auxiliary tasks, they may significantly speed learning on the main task. Learning to predict short-term sensor changes might help discover the concept of objects.
Multi-Headed Networks
Neural networks can have multiple "heads"—one for the main value function, others for auxiliary tasks. All heads propagate errors into a shared body, which learns representations useful for all tasks.
Pavlovian Control
Like classical conditioning, predictions can trigger built-in reflexes. A robot that predicts running out of battery could reflexively return to the charger. The prediction is learned; the response is built in.
The ability to predict and control diverse signals can constitute a powerful kind of environmental model. Many GVFs together can capture how the world works in ways that support efficient planning and reward maximization.
Temporal Abstraction via Options
The MDP formalism can be applied to tasks at many different time scales—from muscle twitches to career decisions. But can we formalize all levels in a single MDP? Options provide a way to do exactly this.
The Options Framework
An option is a generalized notion of action that extends over multiple time steps. It consists of:
Policy : How to select actions while the option executes
Termination function : State-dependent probability of terminating
Formal Definition
An option executes as follows:
At time t, select action from
At time t+1, terminate with probability
If not terminated, continue from step 1 with the new state
Primitive actions are special cases of options. Each action a corresponds to an option whose policy always picks a and whose termination function is zero (always terminates after one step). Options effectively extend the action space.
Bellman Equations with Options
With options, we can define option-value functions and hierarchical policies. The Bellman equation generalizes to:
where is the set of options available in state s, and and are the option models.
Option Models
Reward Model
Expected cumulative discounted reward during option execution:
Transition Model
Discounted probability of terminating in each state:
Planning with options can be much faster because: (1) fewer options need to be considered than primitive actions; (2) each option can jump over many time steps. The tradeoff is that the best hierarchical policy may be suboptimal compared to primitive-action policies.
Observations and State
Throughout this book, we've written value functions as functions of the environment's state. But in reality, agents only receive partial observations—some aspects of the world are hidden. This section explores how to handle partial observability.
From States to Observations
In partial observability, the environment emits observationsrather than states. The agent sees:
where is an observation that depends on, but doesn't fully reveal, the underlying state.
History and State
The history is everything the agent has seen. A state is a compact summary of history: .
The Markov Property
A state function f is Markov if it retains all information necessary to predict any future sequence. Formally:
for all histories h, h' and any test τ (future action-observation sequence).
State-Update Functions
We want a compact, incrementally updatable state. The state-update function computes the next state from the current state and new data:
Agent Architecture with State Update
The state-update function is central to any agent handling partial observability:
1. Observation arrives from environment
2. State-update:
3. Policy selects action:
4. Model and planner use for planning
Approaches to Partial Observability
POMDPs (Belief States)
Assume a latent state exists. The "belief state" is a distribution over possible latent states: . Updated via Bayes' rule, but scales poorly.
PSRs (Predictive State Representations)
Define state as predictions about future observations rather than beliefs about latent states. The semantics are grounded in observable data, making learning potentially easier.
k-th Order History
Simple but effective: use the last k observations and actions as state.. Not Markov but often works well in practice.
If a state function is incrementally updatable, it's Markov if and only if all one-step predictions can be accurately made. Accurate one-step predictions can be iterated to predict any longer sequence—though errors compound!
Designing Reward Signals
Designing a reward signal is a critical part of any RL application. The reward signal frames the designer's goal and assesses progress toward it. Unlike supervised learning, we don't need to know the correct actions—but the success of RL strongly depends on reward design.
The Challenge
RL agents can discover unexpected ways to obtain reward—some desirable, others dangerous. When we specify goals indirectly via rewards, we don't know how closely the agent will fulfill our true desires until learning is complete. This is the classic "reward hacking" problem.
The Sparse Reward Problem
Even with a simple goal, delivering reward frequently enough for learning can be challenging. State-action pairs that deserve reward may be rare, causing the agent to wander aimlessly (the "plateau problem").
Value Function Initialization
Rather than adding supplemental rewards (which can mislead), initialize the value function with a good guess:
where encodes prior knowledge about state values.
Reward Shaping
The psychological technique: gradually modify the task as learning proceeds. Start with easy problems, then increase difficulty. Each stage makes the next feasible because the agent now encounters reward more frequently.
Learning from Experts
Imitation Learning
Learn directly from expert demonstrations via supervised learning. Simple but limited to matching the expert's performance.
Inverse RL
Infer the expert's reward signal from their behavior, then optimize it. Can potentially exceed the expert, but requires strong assumptions.
Counterintuitively, an agent's goal shouldn't always match the designer's goal! Due to constraints (limited computation, limited time), optimizing a different proxy goal can sometimes get closer to the true goal than pursuing it directly. Evolution gave us taste preferences rather than direct nutritional assessment for this reason.
Intrinsic Motivation
Reward signals can depend on internal factors—memories, motivational states, or even how much learning progress is being made. This enables intrinsically-motivated RL: learning for curiosity, not just external reward.
Remaining Issues
Suppose we grant everything in this book and this chapter. What remains? Here we highlight six further issues that future research must address.
Online Deep Learning
Current deep learning methods struggle with incremental, online settings. "Catastrophic interference" causes new learning to overwrite old knowledge. Techniques like replay buffers work around this rather than solving it.
Representation Learning
How can we use experience to learn inductive biases such that future learning generalizes better? This "meta-learning" problem dates to the 1950s-60s and remains unsolved. Learning the state-update function is part of this challenge.
Planning with Learned Models
Planning works great when models are given (chess, Go), but learning models from data and using them for planning is rare. Model learning must be selective—focus on key consequences of important options, not everything.
Automatic Task Design
Currently humans design the tasks agents learn. We want agents to choose their own tasks—subtasks, auxiliary predictions, building blocks for future problems. Tasks should be explicitly represented and searchable.
Computational Curiosity
When external reward is scarce, agents could maximize learning progress as an intrinsic reward—implementing curiosity. This enables something like "play": learning skills that might help with future unknown tasks.
Safety
Methods to make it acceptably safe to embed RL agents in physical environments. This is one of the most pressing areas for future research, discussed further in the next section.
If GVF design is automated, tasks become explicitly represented questions in the machine. Tasks could be built hierarchically like features in a neural network. The tasks are the questions; the network contents are the answers.
RL and the Future of AI
Since the first edition of this book in the 1990s, AI has transformed from promise to applications changing millions of lives. Deep reinforcement learning has produced some of the most remarkable developments. But true AI—complete, interactive agents with general adaptability—remains a distant goal.
RL's Broader Impact
Understanding the Mind
RL connections to psychology and neuroscience shed light on how the mind emerges from the brain.
Human Decision Support
RL policies can advise decision makers in education, healthcare, transportation, energy, and public policy.
Long-term Consequences
RL's key feature—considering long-term consequences—is crucial for high-stakes decisions affecting our planet.
Prometheus and Pandora
Herbert Simon reminded us of the eternal conflict between the promise and perils of new knowledge—the myths of Prometheus (who stole fire for humanity) and Pandora (whose box released perils on the world). We are designers of our future, not mere spectators.
Safety Considerations
RL agents can find unexpected ways to obtain reward—Goethe's "Sorcerer's Apprentice" and Wiener's "Monkey's Paw" warn of optimization that gives you what you asked for, not what you intended.
- Careful reward design is essential for real-world deployment
- Risk management techniques from control engineering can help
- Ensuring agent's goal is attuned to our own remains a challenge
Simulation vs Real World
Simulation Benefits
- • Safe exploration without real damage
- • Unlimited data at low cost
- • Faster-than-real-time learning
- • Perfect reproducibility
Real World Necessity
- • Simulations often lack fidelity
- • Human behavior is hard to simulate
- • Full potential requires real embedding
- • RL is designed for online learning
How do you ensure an agent gets enough experience to learn a good policy while not harming its environment? This is similar to challenges control engineers have faced for decades—careful modeling, validation, testing, and theoretical guarantees about stability and convergence.
Looking Forward
As RL moves into the real world, developers must follow best practices from related technologies while extending them. The benefits of AI can outweigh its disruption, but we have an obligation to ensure Prometheus keeps the upper hand.
A Call to Action
"As designers of our future and not mere spectators, the decisions we make can tilt the scale in Prometheus' favor." — Herbert Simon
RL can help improve the quality, fairness, and sustainability of life on our planet. With careful development, the promise of reinforcement learning can be realized while managing its perils.
Chapter Summary
This final chapter has explored the frontiers of reinforcement learning—topics that push beyond the MDP framework and into active research areas:
- GVFs generalize value functions to predict arbitrary signals
- Options enable planning at multiple time scales
- State-update functions handle partial observability
- Reward design is critical and challenging
- Open problems include online deep learning, representation learning, and safety
- RL's future will transform AI but requires careful development
The Journey Continues
This book has presented the foundations of a reinforcement learning approach to artificial intelligence. The focus has been on model-free and model-based methods working together, combined with function approximation, applied in off-policy training situations.
Much remains to be discovered. The problems highlighted here—online learning, representation learning, planning with learned models, automatic task design, curiosity-driven learning, and safety—are where the next breakthroughs will come. Reinforcement learning will be a critical component of agents with general adaptability, creativity, and the ability to learn quickly from experience.