Skip to main content

Planning and Learning with Tabular Methods

info

In this chapter, you'll learn about:

  • Environment Model: What a model is in the context of RL.
  • Types of Models: The differences between sample models and distribution models.
  • Planning: How planning can be integrated with learning and acting.
  • The Dyna architecture: A unified framework for planning and learning.
  • Model Inaccuracy: Dealing with inaccurate or changing models.

In previous chapters, we explored Monte Carlo and Temporal-Difference (TD) methods for reinforcement learning (RL), which allow agents to learn from direct interaction with the environment. These methods update value estimates based on actual experiences, without requiring a model of the environment's dynamics.

However, humans and animals often plan ahead by imagining the consequences of their actions before they take them. They use internal models of the world to simulate possible futures and make informed decisions without directly experiencing every outcome. Similarly, in RL, agents can leverage models of the environment to improve learning efficiency and performance. This chapter introduces planning in RL and demonstrates how combining planning with learning leads to more effective tabular methods.

What is a Model?

A model in RL represents the agent's knowledge about the environment's dynamics. Specifically, it provides information about the possible next states and rewards, given a current state and action. Models enable agents to predict outcomes without direct interaction, allowing for planning and decision-making based on imagined scenarios.

Types of Models

There are two main types of models:

  1. Sample Models:

    • Definition: Given a state ss and action aa, a sample model generates a possible next state ss' and reward rr according to the environment's dynamics.
    • Usage: Simulate experience by sampling outcomes as if interacting with the environment.
    • Advantages: Computationally efficient and requires less memory since it does not store full probability distributions.
    • Example: Rolling a die multiple times by generating random numbers between 1 and 6.
  2. Distribution Models:

    • Definition: Provides the full probability distribution over possible next states and rewards for each state-action pair.
    • Usage: Calculate expectations, variances, and other statistical properties by summing over all possible outcomes.
    • Advantages: More informative, allowing exact computation of expected values and risks.
    • Example: Specifying the probabilities of all possible sequences when rolling multiple dice.

When to Use Each Model

  • Sample Models are preferred when computational resources are limited or when the environment is too complex to model all probabilities explicitly.
  • Distribution Models are useful when precise calculations of expectations are necessary, such as in risk-sensitive applications.

Planning with Models

Planning refers to the process of using a model to improve the agent's policy or value function without additional real-world experience. By simulating experiences using the model, agents can update their value estimates and make better decisions.

Random Sample One-Step Tabular Q-Planning

An example of planning is Random Sample One-Step Tabular Q-Planning, which involves:

  1. Sampling State-Action Pairs: Randomly select state-action pairs from the set of all possible pairs.
  2. Simulate Experience: Use the model to generate the next state ss' and reward rr for the selected state-action pair.
  3. Value Update: Perform a Q-learning update using the simulated experience: Q(S,A)Q(S,A)+α[r+γmaxaQ(S,a)Q(S,A)]Q(S, A) \leftarrow Q(S, A) + \alpha \left[ r + \gamma \max_{a'} Q(S', a') - Q(S, A) \right]

The Dyna Architecture

The Dyna architecture integrates planning, learning, and acting into a unified framework. It allows agents to learn from both real experiences and simulated experiences generated by a model.

Components of Dyna

  1. Direct Reinforcement Learning:

    • The agent interacts with the environment, collects real experiences, and updates its value estimates (e.g., using Q-learning).
  2. Model Learning:

    • The agent learns a model of the environment's dynamics based on observed transitions.
    • For deterministic environments, this can be as simple as storing the observed next state and reward for each state-action pair.
  3. Planning (Using Simulated Experience):

    • The agent generates simulated experiences using the learned model.
    • Performs planning updates using these simulated experiences to improve the value function.
  4. Search Control:

    • Determines which state-action pairs to use during planning.
    • Can involve strategies like random sampling or prioritizing certain transitions.
The Dyna-Q Algorithm
Initialize Q(s,a) and Model(s,a) s,aS,A(s)Loop forever:S current (nonterminal) stateAϵ-greedy(S,Q)Take action A; observe resultant reward, R, and state, SQ(S,A)Q(S,A)+α[r+γmaxaQ(S,a)Q(S,A)]Model(S,A)R,S (assuming deterministic environment)Loop repeat n times:S random previously observed stateA random action previously taken in SR,SModel(S,A)Q(S,A)Q(S,A)+α[r+γmaxaQ(S,a)Q(S,A)]\begin{align*} &\textbf{Initialize } Q(s, a) \text{ and } Model(s,a) \ \forall s,a \in \mathcal{S}, \mathcal{A}(s) \\ &\textbf{Loop forever:} \\ &\qquad S \leftarrow \text{ current (nonterminal) state}\\ &\qquad A \leftarrow \epsilon\text{-greedy}(S,Q)\\ &\qquad \text{Take action }A \text{; observe resultant reward, } R \text{, and state, } S'\\ &\qquad Q(S, A) \leftarrow Q(S, A) + \alpha \left[ r + \gamma \max_{a'} Q(S', a') - Q(S, A) \right]\\ &\qquad \text{Model}(S, A) \leftarrow R, S' \text{ (assuming deterministic environment)}\\ &\qquad \textbf{Loop } \text{repeat } n \text{ times:}\\ &\qquad\qquad S \leftarrow \text{ random previously observed state}\\ &\qquad\qquad A \leftarrow \text{ random action previously taken in } S\\ &\qquad\qquad R, S' \leftarrow \text{Model}(S, A)\\ &\qquad\qquad Q(S, A) \leftarrow Q(S, A) + \alpha \left[ r + \gamma \max_{a'} Q(S', a') - Q(S, A) \right]\\ \end{align*}

Dealing with Inaccurate Models

In practice, models may become inaccurate for several reasons, including:

  1. Incomplete Knowledge: Early on, the agent has not explored every state-action pair, so the model lacks information about unvisited transitions.

    • Effects:
      • Planning can only be performed for known transitions.
      • Initially limited, but improves as the agent explores more of the environment.
  2. Changing Environments: The environment’s dynamics may change over time, rendering the model outdated.

    • Effects:
      • Can produce suboptimal or incorrect policies if the model fails to reflect the environment’s current dynamics.
      • May hurt performance if the agent relies heavily on an inaccurate model.

Handling Inaccurate Models

To address the issue of model inaccuracies due to changing environments, the Dyna-Q+ algorithm introduces an exploration bonus:

  • Exploration Bonus:

    • Adds a small bonus to the reward used in planning for state-action pairs not visited recently.
    • The bonus increases with the time since the last real visit: Bonus=κτ\text{Bonus} = \kappa \sqrt{\tau}
      • κ\kappa is a small constant.
      • τ\tau is the number of time steps since the last visit to the state-action pair.
  • Effect:

    • Encourages the agent to re-explore parts of the environment it has not visited in a while.
    • Helps the agent discover changes in the environment and update its model accordingly.

When the model is incorrect, planning often yields a suboptimal policy. However, if the model is optimistic, the flawed policy may drive the agent to discover and correct errors quickly by trying to exploit those nonexistent rewards or transitions.

The general problem is another manifestation of exploration vs. exploitation:

  • Exploration: Trying actions to improve the model.
  • Exploitation: Behaving optimally with respect to the current (possibly flawed) model.

Balancing the need to explore changes in the environment against the desire to exploit known dynamics is challenging. No solution is perfect for all cases, but simple heuristics can be effective. For instance, Dyna-Q+ uses a reward bonus in planning for state-action pairs that have not been visited recently, encouraging exploration and model correction.

Expected vs. Sample Updates

Some planning updates rely on expected transitions (summed over all outcomes), while others use sampled transitions.

Expected Updates

  • Definition: Considers all possible next states and their probabilities to compute an exact expectation.
  • Advantages: Unbiased and precise.
  • Disadvantages: Potentially expensive, as it enumerates each possible outcome.

Sample Updates

  • Definition: Uses a single outcome drawn from the distribution.
  • Advantages: Often computationally cheaper, faster to update multiple times.
  • Disadvantages: Introduces sampling variance; may require many samples for high accuracy.

For large branching factors, multiple sample updates frequently outperform a single expected update in practice.

Recap

1. What is a model in reinforcement learning?

2. What distinguishes a sample model from a distribution model in reinforcement learning?

3. What is the main advantage of the Dyna architecture?

4. How does the Dyna-Q+ algorithm address changing environments?

5. What is a key challenge when using inaccurate models in planning?

warning

To be included in a future iteration over this chapter:

  • Planning Techniques:
    • Prioritized Sweeping
    • Trajectory Sampling
  • Decision-Time Planning:
    • Heuristic Search
    • Rollout Algorithms
    • Monte Carlo Tree Search (MCTS)

What's Next

In the next chapter, we move beyond tabular methods to explore function approximation. By representing value functions with parameterized functions, we can generalize across states and actions, making reinforcement learning scalable to complex environments.