Skip to main content

Dynamic Programming

info

In this chapter, you'll learn about:

  • Policy Evaluation: Estimating the value function for a given policy.
  • Policy Improvement: Finding better policies using value functions.
  • Policy Iteration: Alternating between policy evaluation and improvement to find optimal policies.
  • Generalized Policy Iteration (GPI): A framework that combines evaluation and improvement steps.

Dynamic Programming (DP) is a class of algorithms used to solve finite Markov Decision Processes (MDPs) when we have access to a model of the environment's dynamics. DP methods make use of the Bellman equations defined in the previous chapter to iteratively compute value functions and derive optimal policies. In this chapter, we'll explore how DP techniques address the tasks of policy evaluation and policy control.

Policy Evaluation (Prediction)

Policy Evaluation is the task of determining how good a given policy is by estimating the state-value function vπ(s).v_\pi(s). In other words, we want to compute the expected return for each state when the agent follows the policy π\pi.

Iterative Policy Evaluation

The Bellman equation provides a recursive relationship for the value of a state under a given policy π\pi:

vπ(s)=aπ(as)s,rp(s,rs,a)[r+γvπ(s)]v_\pi(s) = \sum_{a} \pi(a|s) \sum_{s', r} p(s', r | s, a) \left[ r + \gamma v_\pi(s') \right]

Similarly, for action-value function:

qπ(s,a)=s,rp(s,rs,a)[r+γmaxaqπ(s,a)]q_\pi(s,a) = \sum_{s',r} p(s',r|s,a) [r + \gamma \max_{a'}q_\pi(s',a')]

This equation forms the basis for iterative policy evaluation, where we start with an arbitrary initial estimate of the value function and iteratively update it using the Bellman equation until it converges to the true value function vπv_\pi.

Each iteration applies the Bellman update to all states. The update rule for each state ss is:

vk+1(s)aπ(as)s,rp(s,rs,a)[r+γvk(s)]v_{k+1}(s) \leftarrow \sum_{a} \pi(a|s) \sum_{s', r} p(s', r | s, a) \left[ r + \gamma v_k(s') \right]
Iterative Policy Evaluation Algorithm
Input π, the policy to be evaluatedAlgorithm parameter θ>0, a small threshold determiningaccuracy of estimationInitialize V(s), for all sS+, arbitrarily except that V(terminal)=0Loop:Δ0Loop for each sS:vV(s)V(s)aπ(as)s,rp(s,rs,a)[r+γV(s)]Δmax(Δ,vV(s))untilΔ<θ\begin{align*} &\textbf{Input } \pi, \text{ the policy to be evaluated} \\ &\textbf{Algorithm parameter } \theta > 0, \text{ a small threshold determining} \\ &\quad \text{accuracy of estimation}\\ &\textbf{Initialize } V(s), \text{ for all } s \in \mathcal{S}^+, \text{ arbitrarily except that }\\ &\quad V(\text{terminal}) = 0 \\[1em] &\textbf{Loop:} \\ &\qquad \Delta \leftarrow 0 \\ &\qquad \textbf{Loop for each } s \in \mathcal{S}:\\ &\qquad\qquad v \leftarrow V(s) \\ &\qquad\qquad V(s) \leftarrow \textstyle \sum_a \pi(a|s) \sum_{s',r} p(s',r|s,a) [r + \gamma V(s')] \\ &\qquad\qquad \Delta \leftarrow \max (\Delta, |v - V(s)|) \\[1em] &\textbf{until} \quad \Delta < \theta \end{align*}
Grid World

We can visualize the propagation of values in a simple 4x4 grid world in the simulation below:

  • All state values are initialized to zero.
  • The agent moves in four possible directions (up, down, left, right), and if it bumps into a wall, it stays in place.
  • The agent follows an optimal policy (defined in the previous chapter).
  • The noise parameter is the probability of the agent taking a suboptimal action.
  • In this example, we don't verify convergence. We'll run it for a predefined number of steps.
Loading...

The Bellman update is applied to each state, propagating information from the terminal states to nearby states. The values start to reflect the expected number of steps to the terminal state. With each step, the values become more accurate as information propagates further across the grid. Eventually, the value function converges

Policy Improvement

Once we have an estimate of the value function for a given policy, we can use it to improve the policy. The idea is to make the policy greedy with respect to the current value function vπ(s)v_\pi(s). This means that in each state, the agent will choose the action that maximizes the expected return, given the current estimates of state values.

Formally, the greedy policy π(s)\pi'(s) is defined as:

π(s)argmaxaqπ(s,a)=argmaxas,rp(s,rs,a)[r+γvπ(s)]\begin{align*} \pi'(s) &\doteq \arg\max_{a} q_\pi(s,a)\\ &= \arg\max_{a} \sum_{s', r} p(s', r | s, a) \left[ r + \gamma v_\pi(s') \right] \end{align*}
Policy Improvement Theorem

The policy improvement theorem guarantees that the greedy policy π\pi' will be at least as good as the original policy π\pi, and strictly better if π\pi was not already optimal. This means that we can iteratively improve the policy by alternating between policy evaluation and greedification.

vπ(s)vπ(s)v_{\pi'}(s) \ge v_\pi(s)

Policy Iteration

Policy iteration is an algorithm that combines policy evaluation and policy improvement to find the optimal policy. It alternates between evaluating the current policy and improving it, and continues until the policy no longer changes, which indicates that the optimal policy has been found.

The steps of policy iteration are:

  1. Policy Evaluation: Evaluate the current policy πk\pi_k to obtain vπk(s)v_{\pi_k}(s).
  2. Policy Improvement: Improve the policy by acting greedily with respect to the current value function, yielding a new policy πk+1\pi_{k+1}.
  3. Repeat steps 1 and 2 until the policy converges to the optimal policy π\pi^*.
Policy Iteration Algorithm
1.InitializationV(s)R and π(s)A(s) arbitrarily sS2.Policy EvaluationLoop:Δ0Loop for each sS:vV(s)V(s)s,rp(s,rs,π(s))[r+γV(s)]Δmax(Δ,vV(s))until Δ<θ3.Policy Improvementpolicy-statetrueLoop for each sS:old-actionπ(s)π(s)arg maxas,rp(s,rs,a)[r+γV(s)]If old-actionπ(s), then policy-stablefalseIf policy-stable, then stop and return Vvandππ;Else go to 2\begin{align*} 1. \quad&\textbf{Initialization}\\ &V(s) \in \mathbb{R} \text{ and } \pi(s) \in \mathcal{A}(s) \text{ arbitrarily } \forall s \in \mathcal{S} \\[1em] 2. \quad&\textbf{Policy Evaluation}\\ &\textbf{Loop:} \\ &\qquad \Delta \leftarrow 0\\ &\qquad \textbf{Loop for each } s \in \mathcal{S}: \\ &\qquad\qquad v \leftarrow V(s) \\ &\qquad\qquad V(s) \leftarrow \textstyle \sum_{s',r} p(s',r|s,\pi(s)) [r+\gamma V(s')] \\ &\qquad\qquad \Delta \leftarrow \max(\Delta, |v-V(s)|)\\ &\textbf{until } \Delta < \theta \\[1em] 3. \quad&\textbf{Policy Improvement}\\ &\textit{policy-state} \leftarrow \textit{true}\\ &\textbf{Loop for each } s \in \mathcal{S}: \\ &\qquad \textit{old-action} \leftarrow \pi(s) \\ &\qquad \pi(s) \leftarrow \textstyle \argmax_a \sum_{s',r} p(s',r|s,a) [r + \gamma V(s')] \\ &\qquad \textbf{If } \textit{old-action} \neq \pi(s), \textbf{ then } \textit{policy-stable} \leftarrow \textit{false}\\ &\textbf{If } \textit{policy-stable}, \textbf{ then } \text{stop and return } V \approx v_* \text{and} \pi \approx \pi_*;\\ &\textbf{Else go to 2} \end{align*}

Generalized Policy Iteration (GPI)

In practice, we don’t always have to run policy evaluation to completion before improving the policy. We can interleave evaluation and improvement steps more flexibly. This leads to the concept of Generalized Policy Iteration (GPI), where policy evaluation and improvement are performed simultaneously and incrementally.

Value Iteration

Value iteration is a special case of GPI where we combine policy evaluation and improvement into a single step. Instead of fully evaluating the policy at each step, we perform a single Bellman update for each state and immediately use it to improve the policy.

Asynchronous Dynamic Programming

In synchronous DP methods, we sweep through all states in a systematic order. In contrast, asynchronous DP methods update the values of states in any order, which can lead to faster convergence in certain situations. For example, we can prioritize updating states that are most relevant or have recently changed values.

Recap

1. What is the main goal of Policy Evaluation?

2. What is the result of Policy Improvement?

3. What does Policy Iteration alternate between?

4. What distinguishes Generalized Policy Iteration (GPI) from Policy Iteration?

5. How does Value Iteration differ from Policy Iteration?

What's Next

Up until now, the methods we looked into require a complete model of the environment's dynamics to compute optimal policies and value functions. However, what if the environment is too complex to model or the dynamics are unknown? In the next chapter, we'll explore Monte Carlo Methods, which overcome this limitation by learning directly from sampled episodes of experience.