Chapter 17

Frontiers

The Future of Reinforcement Learning

In this final chapter, we explore topics beyond the book's scope but crucial for RL's future: general value functions, temporal abstraction, partial observability, reward design challenges, and RL's role in artificial intelligence.

Looking Ahead

This chapter touches on topics that bring us beyond what is reliably known, and some bring us beyond the MDP framework itself. These are the frontiers where active research is pushing the boundaries of what reinforcement learning can achieve.

Abstraction

Partial Observability

Real-World Safety

Section 17.1

General Value Functions and Auxiliary Tasks

Throughout this book, our notion of value function has become quite general. With off-policy learning we allowed value functions conditional on arbitrary target policies. With termination functions, we generalized discounting. The final step is to generalize beyond rewards to permit predictions about arbitrary signals.

General Value Functions (GVFs)

Rather than predicting the sum of future rewards, we might predict the sum of future values of any signal—a sound, a color sensation, or even another prediction. Whatever signal is accumulated, we call it the cumulant.

General Value Function Definition

vπ,γ,C(s)E[k=t(i=t+1kγ(Si))Ck+1St=s,At:π]v_{\pi,\gamma,C}(s) \doteq \mathbb{E}\left[\sum_{k=t}^{\infty} \left(\prod_{i=t+1}^{k} \gamma(S_i)\right) C_{k+1} \,\bigg|\, S_t = s, A_{t:\infty} \sim \pi\right]

π\pi — the policy being followed

γ\gamma — state-dependent termination function

CtC_t — the cumulant signal (replaces reward)

Cumulant

The cumulant CtRC_t \in \mathbb{R} is any signal that can be accumulated in a value-function-like prediction. It might be a sensory signal, an internal prediction, or any other scalar quantity the agent can observe.

Why Predict Many Signals?

Auxiliary tasks are predictions/controls of signals beyond the main reward. They can help in several ways:

Shared Representations

Auxiliary tasks may require some of the same representations as the main task. If good features are found on easier auxiliary tasks, they may significantly speed learning on the main task. Learning to predict short-term sensor changes might help discover the concept of objects.

Multi-Headed Networks

Neural networks can have multiple "heads"—one for the main value function, others for auxiliary tasks. All heads propagate errors into a shared body, which learns representations useful for all tasks.

Pavlovian Control

Like classical conditioning, predictions can trigger built-in reflexes. A robot that predicts running out of battery could reflexively return to the charger. The prediction is learned; the response is built in.

Environmental Models from GVFs

The ability to predict and control diverse signals can constitute a powerful kind of environmental model. Many GVFs together can capture how the world works in ways that support efficient planning and reward maximization.

Section 17.2

Temporal Abstraction via Options

The MDP formalism can be applied to tasks at many different time scales—from muscle twitches to career decisions. But can we formalize all levels in a single MDP? Options provide a way to do exactly this.

The Options Framework

An option is a generalized notion of action that extends over multiple time steps. It consists of:

Policy πω\pi_\omega: How to select actions while the option executes

Termination function γω\gamma_\omega: State-dependent probability of terminating

Formal Definition

An option ω=πω,γω\omega = \langle \pi_\omega, \gamma_\omega \rangle executes as follows:

1

At time t, select action AtA_t from πω(St)\pi_\omega(\cdot|S_t)

2

At time t+1, terminate with probability 1γω(St+1)1 - \gamma_\omega(S_{t+1})

3

If not terminated, continue from step 1 with the new state

Actions as Special Options

Primitive actions are special cases of options. Each action a corresponds to an option whose policy always picks a and whose termination function is zero (always terminates after one step). Options effectively extend the action space.

Bellman Equations with Options

With options, we can define option-value functions and hierarchical policies. The Bellman equation generalizes to:

vπ(s)=ωΩ(s)π(ωs)[r(s,ω)+sp(ss,ω)vπ(s)]v_\pi(s) = \sum_{\omega \in \Omega(s)} \pi(\omega|s) \left[ r(s, \omega) + \sum_{s'} p(s'|s, \omega) v_\pi(s') \right]

where Ω(s)\Omega(s) is the set of options available in state s, and r(s,ω)r(s, \omega) and p(ss,ω)p(s'|s, \omega) are the option models.

Option Models

Reward Model

Expected cumulative discounted reward during option execution:

r(s,ω)=E[R1+γR2+]r(s, \omega) = \mathbb{E}[R_1 + \gamma R_2 + \ldots]

Transition Model

Discounted probability of terminating in each state:

p(ss,ω)=k=1γkPr{Sk=s,τ=k}p(s'|s, \omega) = \sum_{k=1}^{\infty} \gamma^k \Pr\{S_k=s', \tau=k\}
Benefits of Options

Planning with options can be much faster because: (1) fewer options need to be considered than primitive actions; (2) each option can jump over many time steps. The tradeoff is that the best hierarchical policy may be suboptimal compared to primitive-action policies.

Section 17.3

Observations and State

Throughout this book, we've written value functions as functions of the environment's state. But in reality, agents only receive partial observations—some aspects of the world are hidden. This section explores how to handle partial observability.

From States to Observations

In partial observability, the environment emits observationsrather than states. The agent sees:

A0,O1,A1,O2,A2,O3,A_0, O_1, A_1, O_2, A_2, O_3, \ldots

where OtOO_t \in \mathcal{O} is an observation that depends on, but doesn't fully reveal, the underlying state.

History and State

The history Ht=A0,O1,,At1,OtH_t = A_0, O_1, \ldots, A_{t-1}, O_tis everything the agent has seen. A state is a compact summary of history: St=f(Ht)S_t = f(H_t).

The Markov Property

A state function f is Markov if it retains all information necessary to predict any future sequence. Formally:

f(h)=f(h)p(τh)=p(τh)f(h) = f(h') \Rightarrow p(\tau|h) = p(\tau|h')

for all histories h, h' and any test τ (future action-observation sequence).

State-Update Functions

We want a compact, incrementally updatable state. The state-update function computes the next state from the current state and new data:

St+1=u(St,At,Ot+1)S_{t+1} = u(S_t, A_t, O_{t+1})

Agent Architecture with State Update

The state-update function is central to any agent handling partial observability:

1. Observation OtO_t arrives from environment

2. State-update: St=u(St1,At1,Ot)S_t = u(S_{t-1}, A_{t-1}, O_t)

3. Policy selects action: Atπ(St)A_t \sim \pi(\cdot|S_t)

4. Model and planner use StS_t for planning

Approaches to Partial Observability

POMDPs (Belief States)

Assume a latent state XtX_t exists. The "belief state" is a distribution over possible latent states: st[i]=Pr{Xt=iHt}s_t[i] = \Pr\{X_t = i | H_t\}. Updated via Bayes' rule, but scales poorly.

PSRs (Predictive State Representations)

Define state as predictions about future observations rather than beliefs about latent states. The semantics are grounded in observable data, making learning potentially easier.

k-th Order History

Simple but effective: use the last k observations and actions as state.St=Ot,At1,Ot1,,AtkS_t = O_t, A_{t-1}, O_{t-1}, \ldots, A_{t-k}. Not Markov but often works well in practice.

One-Step Predictions Suffice

If a state function is incrementally updatable, it's Markov if and only if all one-step predictions can be accurately made. Accurate one-step predictions can be iterated to predict any longer sequence—though errors compound!

Section 17.4

Designing Reward Signals

Designing a reward signal is a critical part of any RL application. The reward signal frames the designer's goal and assesses progress toward it. Unlike supervised learning, we don't need to know the correct actions—but the success of RL strongly depends on reward design.

The Challenge

RL agents can discover unexpected ways to obtain reward—some desirable, others dangerous. When we specify goals indirectly via rewards, we don't know how closely the agent will fulfill our true desires until learning is complete. This is the classic "reward hacking" problem.

The Sparse Reward Problem

Even with a simple goal, delivering reward frequently enough for learning can be challenging. State-action pairs that deserve reward may be rare, causing the agent to wander aimlessly (the "plateau problem").

Value Function Initialization

Rather than adding supplemental rewards (which can mislead), initialize the value function with a good guess:

v^(s,w)=wx(s)+v0(s)\hat{v}(s, \mathbf{w}) = \mathbf{w}^\top \mathbf{x}(s) + v_0(s)

where v0(s)v_0(s) encodes prior knowledge about state values.

Reward Shaping

The psychological technique: gradually modify the task as learning proceeds. Start with easy problems, then increase difficulty. Each stage makes the next feasible because the agent now encounters reward more frequently.

Learning from Experts

Imitation Learning

Learn directly from expert demonstrations via supervised learning. Simple but limited to matching the expert's performance.

Inverse RL

Infer the expert's reward signal from their behavior, then optimize it. Can potentially exceed the expert, but requires strong assumptions.

Agent's Goal vs Designer's Goal

Counterintuitively, an agent's goal shouldn't always match the designer's goal! Due to constraints (limited computation, limited time), optimizing a different proxy goal can sometimes get closer to the true goal than pursuing it directly. Evolution gave us taste preferences rather than direct nutritional assessment for this reason.

Intrinsic Motivation

Reward signals can depend on internal factors—memories, motivational states, or even how much learning progress is being made. This enables intrinsically-motivated RL: learning for curiosity, not just external reward.

Section 17.5

Remaining Issues

Suppose we grant everything in this book and this chapter. What remains? Here we highlight six further issues that future research must address.

1

Online Deep Learning

Current deep learning methods struggle with incremental, online settings. "Catastrophic interference" causes new learning to overwrite old knowledge. Techniques like replay buffers work around this rather than solving it.

2

Representation Learning

How can we use experience to learn inductive biases such that future learning generalizes better? This "meta-learning" problem dates to the 1950s-60s and remains unsolved. Learning the state-update function is part of this challenge.

3

Planning with Learned Models

Planning works great when models are given (chess, Go), but learning models from data and using them for planning is rare. Model learning must be selective—focus on key consequences of important options, not everything.

4

Automatic Task Design

Currently humans design the tasks agents learn. We want agents to choose their own tasks—subtasks, auxiliary predictions, building blocks for future problems. Tasks should be explicitly represented and searchable.

5

Computational Curiosity

When external reward is scarce, agents could maximize learning progress as an intrinsic reward—implementing curiosity. This enables something like "play": learning skills that might help with future unknown tasks.

6

Safety

Methods to make it acceptably safe to embed RL agents in physical environments. This is one of the most pressing areas for future research, discussed further in the next section.

The Questions Are the Answers

If GVF design is automated, tasks become explicitly represented questions in the machine. Tasks could be built hierarchically like features in a neural network. The tasks are the questions; the network contents are the answers.

Section 17.6

RL and the Future of AI

Since the first edition of this book in the 1990s, AI has transformed from promise to applications changing millions of lives. Deep reinforcement learning has produced some of the most remarkable developments. But true AI—complete, interactive agents with general adaptability—remains a distant goal.

RL's Broader Impact

Understanding the Mind

RL connections to psychology and neuroscience shed light on how the mind emerges from the brain.

Human Decision Support

RL policies can advise decision makers in education, healthcare, transportation, energy, and public policy.

Long-term Consequences

RL's key feature—considering long-term consequences—is crucial for high-stakes decisions affecting our planet.

Prometheus and Pandora

Herbert Simon reminded us of the eternal conflict between the promise and perils of new knowledge—the myths of Prometheus (who stole fire for humanity) and Pandora (whose box released perils on the world). We are designers of our future, not mere spectators.

Safety Considerations

RL agents can find unexpected ways to obtain reward—Goethe's "Sorcerer's Apprentice" and Wiener's "Monkey's Paw" warn of optimization that gives you what you asked for, not what you intended.

  • Careful reward design is essential for real-world deployment
  • Risk management techniques from control engineering can help
  • Ensuring agent's goal is attuned to our own remains a challenge

Simulation vs Real World

Simulation Benefits

  • • Safe exploration without real damage
  • • Unlimited data at low cost
  • • Faster-than-real-time learning
  • • Perfect reproducibility

Real World Necessity

  • • Simulations often lack fidelity
  • • Human behavior is hard to simulate
  • • Full potential requires real embedding
  • • RL is designed for online learning
Learning While Acting Safely

How do you ensure an agent gets enough experience to learn a good policy while not harming its environment? This is similar to challenges control engineers have faced for decades—careful modeling, validation, testing, and theoretical guarantees about stability and convergence.

Looking Forward

As RL moves into the real world, developers must follow best practices from related technologies while extending them. The benefits of AI can outweigh its disruption, but we have an obligation to ensure Prometheus keeps the upper hand.

A Call to Action

"As designers of our future and not mere spectators, the decisions we make can tilt the scale in Prometheus' favor." — Herbert Simon

RL can help improve the quality, fairness, and sustainability of life on our planet. With careful development, the promise of reinforcement learning can be realized while managing its perils.

Chapter Summary

This final chapter has explored the frontiers of reinforcement learning—topics that push beyond the MDP framework and into active research areas:

  • GVFs generalize value functions to predict arbitrary signals
  • Options enable planning at multiple time scales
  • State-update functions handle partial observability
  • Reward design is critical and challenging
  • Open problems include online deep learning, representation learning, and safety
  • RL's future will transform AI but requires careful development

The Journey Continues

This book has presented the foundations of a reinforcement learning approach to artificial intelligence. The focus has been on model-free and model-based methods working together, combined with function approximation, applied in off-policy training situations.

Much remains to be discovered. The problems highlighted here—online learning, representation learning, planning with learned models, automatic task design, curiosity-driven learning, and safety—are where the next breakthroughs will come. Reinforcement learning will be a critical component of agents with general adaptability, creativity, and the ability to learn quickly from experience.