Dynamic Programming
In this chapter, you'll learn about:
- Policy Evaluation: Estimating the value function for a given policy.
- Policy Improvement: Finding better policies using value functions.
- Policy Iteration: Alternating between policy evaluation and improvement to find optimal policies.
- Generalized Policy Iteration (GPI): A framework that combines evaluation and improvement steps.
Dynamic Programming (DP) is a class of algorithms used to solve finite Markov Decision Processes (MDPs) when we have access to a model of the environment's dynamics. DP methods make use of the Bellman equations defined in the previous chapter to iteratively compute value functions and derive optimal policies. In this chapter, we'll explore how DP techniques address the tasks of policy evaluation and policy control.
Policy Evaluation (Prediction)
Policy Evaluation is the task of determining how good a given policy is by estimating the state-value function In other words, we want to compute the expected return for each state when the agent follows the policy .
Iterative Policy Evaluation
The Bellman equation provides a recursive relationship for the value of a state under a given policy :
Similarly, for action-value function:
This equation forms the basis for iterative policy evaluation, where we start with an arbitrary initial estimate of the value function and iteratively update it using the Bellman equation until it converges to the true value function .
Each iteration applies the Bellman update to all states. The update rule for each state is:
We can visualize the propagation of values in a simple 4x4 grid world in the simulation below:
- All state values are initialized to zero.
- The agent moves in four possible directions (up, down, left, right), and if it bumps into a wall, it stays in place.
- The agent follows an optimal policy (defined in the previous chapter).
- The noise parameter is the probability of the agent taking a suboptimal action.
- In this example, we don't verify convergence. We'll run it for a predefined number of steps.
The Bellman update is applied to each state, propagating information from the terminal states to nearby states. The values start to reflect the expected number of steps to the terminal state. With each step, the values become more accurate as information propagates further across the grid. Eventually, the value function converges
Policy Improvement
Once we have an estimate of the value function for a given policy, we can use it to improve the policy. The idea is to make the policy greedy with respect to the current value function . This means that in each state, the agent will choose the action that maximizes the expected return, given the current estimates of state values.
Formally, the greedy policy is defined as:
The policy improvement theorem guarantees that the greedy policy will be at least as good as the original policy , and strictly better if was not already optimal. This means that we can iteratively improve the policy by alternating between policy evaluation and greedification.
Policy Iteration
Policy iteration is an algorithm that combines policy evaluation and policy improvement to find the optimal policy. It alternates between evaluating the current policy and improving it, and continues until the policy no longer changes, which indicates that the optimal policy has been found.
The steps of policy iteration are:
- Policy Evaluation: Evaluate the current policy to obtain .
- Policy Improvement: Improve the policy by acting greedily with respect to the current value function, yielding a new policy .
- Repeat steps 1 and 2 until the policy converges to the optimal policy .
Generalized Policy Iteration (GPI)
In practice, we don’t always have to run policy evaluation to completion before improving the policy. We can interleave evaluation and improvement steps more flexibly. This leads to the concept of Generalized Policy Iteration (GPI), where policy evaluation and improvement are performed simultaneously and incrementally.
Value Iteration
Value iteration is a special case of GPI where we combine policy evaluation and improvement into a single step. Instead of fully evaluating the policy at each step, we perform a single Bellman update for each state and immediately use it to improve the policy.
Asynchronous Dynamic Programming
In synchronous DP methods, we sweep through all states in a systematic order. In contrast, asynchronous DP methods update the values of states in any order, which can lead to faster convergence in certain situations. For example, we can prioritize updating states that are most relevant or have recently changed values.
Recap
In this chapter, we explored how dynamic programming can be used to solve MDPs:
- Policy Evaluation: Estimating the value function for a given policy.
- Policy Improvement: Improving the policy by acting greedily with respect to the value function.
- Policy Iteration: Alternating between evaluation and improvement to find the optimal policy.
- Generalized Policy Iteration: A flexible framework that interleaves evaluation and improvement steps.
- Value Iteration: A special case of GPI that combines evaluation and improvement in a single step.
- Asynchronous DP: Techniques for updating state values in an arbitrary order to speed up convergence.