Markov Decision Processes

info

In this chapter, you'll learn about:

Markov Decision Processes (MDPs): A mathematical framework for modeling sequential decision-making.
Policies and Value Functions: How agents decide on actions and evaluate states.
Bellman Equations: Recursive relationships between value functions in MDPs.
Optimality: Concepts of optimal policies and value functions.

In the previous chapter, we explored the k-armed bandit problem, which introduced us to decision-making under uncertainty. However, the bandit problem is limited because it doesn't consider situations where different actions are optimal in different states, nor does it account for the long-term consequences of actions. MDPs provide a framework to model such sequential decision-making problems where actions influence not only immediate rewards but also future states and rewards.

Introduction to MDPs

In an MDP, the agent's interaction cycle involves observing the current state, taking an action, and receiving a reward. This reward assesses how favorable the action, or series of actions was. For example, in a video game, the score might serve as the reward, while for a running robot, the distance covered could determine the reward. The agent's objective is to maximize its cumulative reward by learning the optimal action for each situation through trial and error.

In an MDP, the interaction between the agent and the environment is modeled at discrete time steps, $t=0,1,2,3, \cdots$ , and defined by the components:

Set of States ( $\mathcal{S}$ ): The different situations the agent can encounter.
Set of Actions ( $\mathcal{A}$ ): The possible actions the agent can take at each state.
Transition Function ( $p(s'|s,a)$ ): This defines the probability of transitioning to a new state $s'$ given the current state $s$ and action $a$ . Also called the environment dynamics.

p(s'|s,a) \doteq \text{Pr}(S_t=s'|S_{t-1}=s,A_{t-1}=a)

Reward Function ( $r(s,a,s')$ ): This assigns a reward based on the transition from state $s$ to state $s'$ via action $a$ .

r(s,a,s') \doteq \mathbb{E}[R_t|S_{t-1}=s,A_{t-1}=a]

Start State ( $s_0$ ): The initial state or distribution of states where the agent begins.
Horizon ( $H$ ): The length of time over which the agent will operate and make decisions.

The process of repeated agent-environment interactions generates a trajectory of experience:

S_0, A_0, R_1, S_1, A_1, R_2, S_2, \dots, S_{H-1}, A_{H-1}, R_H

An MDP is a formalism that describes an environment in which all states and actions are Markov, meaning the future is independent of the past given the present state. It encapsulates the dynamics of the environment and the agent's interaction with it over discrete time steps.

Markov Property

The future state and reward depend only on the current state and action, not on past states or actions. This property implies that the current state captures all relevant information from the history.

Example: Recycling Robot

Consider a robot that collects empty soda cans in an office environment.

States: Battery charge level: High or Low.
Actions:
- Search: Look for cans.
- Wait: Stay stationary and wait for someone to bring a can.
- Recharge: Go to the charging station (only allowed in Low state).
Rewards:
- Search: Reward $r_{\text{search}}$ , e.g., +10.
- Wait: Reward $r_{\text{wait}}$ , e.g., +1.
- Recharge: Reward 0.
- Rescue Needed: If the robot runs out of power, negative reward $r_{\text{rescue}}$ , e.g., -20.

Transition Dynamics:

From High State:
- Search:
  - Remains in High with probability $\alpha$ .
  - Transitions to Low with probability $1 - \alpha$ .
- Wait: Remains in High.
From Low State:
- Search:
  - Battery depletes (needs rescue) with probability $1 - \beta$ $1 - β$ .
    - Transition to High (after rescue).
    - Reward: $r_{\text{rescue}}$ (negative).
  - Remains in Low with probability $\beta$ $β$ .
    - Reward: $r_{\text{search}}$ .
- Recharge: Transitions to High.
  - Reward: 0.

Returns and Episodes

Goal in RL

The goal in reinforcement learning is to maximize the expected cumulative reward, also known as the return.

Episodic Tasks

Definition: Tasks that naturally break into episodes, which are sequences that end in a terminal state.
Examples:
- Games: Each game is an episode that starts and ends independently of others.
- Navigation Tasks: Reaching a destination concludes an episode.

In episodic tasks, the interaction breaks into episodes ending at time $T$ . The return is defined as:

G_t \doteq R_{t+1} + R_{t+2} + R_{t+3} + \dots + R_T = \sum_{k=0}^{T - t - 1} R_{t + k + 1}

Continuing Tasks

Definition: Tasks where the agent-environment interaction continues indefinitely without a terminal state.
Examples:
- Job Scheduling: Jobs arrive continuously, and the agent must make decisions at each time step.
- Stock Trading: The market operates continuously, and trading decisions are made over an indefinite period.

In continuing tasks, where the interaction goes on indefinitely, we introduce a discount factor $\gamma$ ( $0 \leq \gamma < 1$ ) to ensure the cumulative reward is finite:

G_t \doteq R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}

Effect of $\gamma$ :
- Short-sighted ( $\gamma \approx 0$ ): Agent focuses on immediate rewards.
- Far-sighted ( $\gamma \approx 1$ ): Future rewards are weighted the same as immediate rewards.

Recursive Property of Return

The return can be defined recursively:

G_t = R_{t+1} + \gamma G_{t+1}

This property is fundamental in deriving many reinforcement learning algorithms.

Grid World

Throughout the chapters, we'll use a canonical domain called Grid World to build intuition about MDPs and RL algorithms. In grid world, an agent navigates a grid to reach a goal while avoiding obstacles and pitfalls, with rewards assigned to reaching specific squares.

In our grid world, the agent in this world can move in four directions, if it tries to move out of the grid or into obstacles, it stays in the same cell. If it lands on a minus-one square, it receives a reward of -1 and the episode ends; if it lands on a plus-one square, it receives a reward of +1 and the episode ends.

Go ahead, play with it by clicking on the arrows below the grid:

Policies and Value Functions

A policy defines the agent's behavior at each state. It is a mapping from states to actions.

Deterministic Policy ( $\pi$ ): Maps each state to a specific action.
$\pi(s) = a$
Stochastic Policy ( $\pi(a|s)$ ): Assigns probabilities to each action in a state.
$\pi(a|s) \doteq \text{Pr}(A_t = a | S_t = s)$

Grid World

A deterministic policy for our Grid World is:

→

↑

→

↑

←

Value functions estimate how good it is for an agent to be in a given state (or state-action pair), considering future rewards.

State-Value Function ( $v_\pi(s)$ ): Expected return starting from state $s$ and following policy $\pi$ .
$v_\pi(s) \doteq \mathbb{E}_\pi [G_t | S_t = s] = \mathbb{E}_\pi \bigg[ \sum_{k=0}^\infty \gamma^k R_{t+k+1} \big| S_t = s \bigg], \forall s \in \mathcal{S}$
Action-Value Function ( $q_\pi(s, a)$ ): Expected return starting from state $s$ , taking action $a$ , and thereafter following policy $\pi$ .
$q_\pi(s, a) \doteq \mathbb{E}_\pi [G_t | S_t = s, A_t = a] = \mathbb{E}_\pi \bigg[ \sum_{k=0}^\infty \gamma^k R_{t+k+1} \big| S_t = s, A_t = a \bigg]$

Value functions allow agents to evaluate the desirability of states and actions without waiting for long-term outcomes.

Grid World

Let's imagine three different scenarios in our Grid World:

A deterministic environment with no discounting ( $\gamma$ = 1) and a horizon $H$ of 100 steps, we can calculate the value function as follows:
- $V^*(4,3)$ : The agent immediately collects the +1 reward.
- $V^*(3,3)$ : The agent moves right to collect the +1 reward in one step.
- $V^*(2,3)$ : The agent moves right twice to collect the reward, still yielding a value of +1 due to no discounting.
- $V^*(1,1)$ : It takes five steps to reach the reward, but still has a value of +1.
- $V^*(4,2)$ : The agent is stuck in the bomb and receives a reward of -1.
1.00
1.00
1.00
0.00
1.00
0.00
1.00
0.00
1.00
1.00
1.00
1.00
Introducing a discount factor gamma of 0.9, where actions are still deterministic:
- $V^*(4,3)$ : The value remains 1 because the reward is immediate.
- $V^*(3,3)$ : The reward is delayed by one step, so the value is $0.9^1 \times 1 = 0.9$ .
- $V^*(2,3)$ : The value is delayed by two steps, so the value is $0.9^2 \times 1 = 0.81$ .
- $V^*(1,1)$ : The value is $0.9^5 \times 1 = 0.59$ .
- $V^*(4,2)$ : The value remains -1 due to the bomb.
0.73
0.81
0.90
0.00
0.66
0.00
0.81
0.00
0.59
0.66
0.73
0.66
Adding Stochasticity, where the action success probability drops to 80%:
- $V^*(4,3)$ : The value remains 1 since grabbing the prize is certain.
- $V^*(3,3)$ : Moving right has an 80% success rate. If successful, the value is $0.9 \times 1$ . With a 20% chance, the agent may stay in place (trying to move up), move down or move left, affecting the value recursively based on neighboring squares.

This recursive dependency on neighboring values is captured by the Bellman Equation.

Bellman Equations

The Bellman equation expresses the relationship between the value of a state and the values of its successor states.

v_\pi(s) = \sum_{a} \pi(a|s) \sum_{s', r} p(s', r | s, a) \left[ r + \gamma v_\pi(s') \right]

The value of state $s$ is the expected return from taking action $a$ , receiving reward $r$ , and transitioning to state $s'$ , plus the discounted value of $s'$ .

Similarly, for the action-value function:

q_\pi(s, a) = \sum_{s', r} p(s', r | s, a) \left[ r + \gamma \sum_{a'} \pi(a'|s') q_\pi(s', a') \right]

Optimality in MDPs

An optimal policy $\pi^*$ yields the highest value for all states compared to all other policies.

Definition:
$v_{\pi^*}(s) \geq v_\pi(s), \quad \forall s \in \mathcal{S}, \forall \pi$
There may be multiple optimal policies that share the same optimal value function.
Optimal State-Value Function ( $v_*(s)$ ):
$v_*(s) = \max_\pi v_\pi(s)$
Optimal Action-Value Function ( $q_*(s, a)$ ):
$q_*(s, a) = \max_\pi q_\pi(s, a)$

Bellman Optimality Equations

These equations define the optimal value functions recursively.

For $v_*(s)$ :
$v_*(s) = \max_{a} \sum_{s', r} p(s', r | s, a) \left[ r + \gamma v_*(s') \right]$
For $q_*(s, a)$ :
$q_*(s, a) = \sum_{s', r} p(s', r | s, a) \left[ r + \gamma \max_{a'} q_*(s', a') \right]$

Once we have $v_*(s)$ or $q_*(s, a)$ , we can derive an optimal policy:

From $v_*(s)$ :
$\pi^*(s) = \arg\max_{a} \sum_{s', r} P(s', r | s, a) \left[ r + \gamma v_*(s') \right]$
From $q_*(s, a)$ :
$\pi^*(s) = \arg\max_{a} q_*(s, a)$

Recap

What's Next

In the next chapter, we introduce Dynamic Programming (DP), a set of methods for solving Markov Decision Processes (MDPs) when we have access to the environment's dynamics. You'll learn how to compute value functions, improve policies, and derive optimal policies using concepts like policy evaluation, policy improvement, and Generalized Policy Iteration (GPI).

Markov Decision Processes

Introduction to MDPs

Transition Dynamics:

Returns and Episodes

Episodic Tasks

Continuing Tasks

Policies and Value Functions

Bellman Equations

Optimality in MDPs

Bellman Optimality Equations

Recap

1. What is a Markov Decision Process (MDP)?

2. Which property defines the Markov assumption?

3. What does the transition function $p(s'|s,a)$ represent in an MDP?

4. How is the value of a state $v_\pi(s)$ defined under policy $\pi$ ?

5. What is the Bellman equation used for?

What's Next

Introduction to MDPs​

Transition Dynamics:​

Returns and Episodes​

Episodic Tasks​

Continuing Tasks​

Policies and Value Functions​

Bellman Equations​

Optimality in MDPs​

Bellman Optimality Equations​

Recap​

1. What is a Markov Decision Process (MDP)?

2. Which property defines the Markov assumption?

3. What does the transition function p(s′∣s,a)p(s'|s,a)p(s′∣s,a) represent in an MDP?

4. How is the value of a state vπ(s)v_\pi(s)vπ​(s) defined under policy π\piπ?

5. What is the Bellman equation used for?

What's Next​

Introduction to MDPs

Transition Dynamics:

Returns and Episodes

Episodic Tasks

Continuing Tasks

Policies and Value Functions

Bellman Equations

Optimality in MDPs

Bellman Optimality Equations

Recap

3. What does the transition function $p(s'|s,a)$ represent in an MDP?

4. How is the value of a state $v_\pi(s)$ defined under policy $\pi$ ?

What's Next