Markov Decision Processes
In this chapter, you'll learn about:
 Markov Decision Processes (MDPs): A mathematical framework for modeling sequential decisionmaking.
 Policies and Value Functions: How agents decide on actions and evaluate states.
 Bellman Equations: Recursive relationships between value functions in MDPs.
 Optimality: Concepts of optimal policies and value functions.
In the previous chapter, we explored the karmed bandit problem, which introduced us to decisionmaking under uncertainty. However, the bandit problem is limited because it doesn't consider situations where different actions are optimal in different states, nor does it account for the longterm consequences of actions. MDPs provide a framework to model such sequential decisionmaking problems where actions influence not only immediate rewards but also future states and rewards.
Introduction to MDPs
In an MDP, the agent's interaction cycle involves observing the current state, taking an action, and receiving a reward based on the new state. This reward assesses how favorable the new state is. For example, in a video game, the score might serve as the reward, while for a running robot, the distance covered could determine the reward. The agent's objective is to maximize its cumulative reward by learning the optimal action for each situation through trial and error.
In an MDP, the interaction between the agent and the environment is modeled at discrete time steps, $t=0,1,2,3, \cdots$, and defined by the components:
 Set of States ($\mathcal{S}$): The different situations the agent can encounter.
 Set of Actions ($\mathcal{A}$): The possible actions the agent can take.
 Transition Function ($p(s's,a)$): This defines the probability of transitioning to a new state $s'$ given the current state $s$ and action $a$. Also called the environment dynamics.
 Reward Function ($r(s,a,s')$): This assigns a reward based on the transition from state $s$ to state $s'$ via action $a$.
 Start State ($s_0$): The initial state or distribution of states where the agent begins.
 Horizon ($H$): The length of time over which the agent will operate and make decisions.
The process of repeated agentenvironment interactions generates a trajectory of experience:
$S_0, A_0, R_1, S_1, A_1, R_2, S_2, \dots, S_{H1}, A_{H1}, R_H$An MDP is a formalism that describes an environment in which all states and actions are Markov, meaning the future is independent of the past given the present state. It encapsulates the dynamics of the environment and the agent's interaction with it over discrete time steps.
The future state and reward depend only on the current state and action, not on past states or actions. This property implies that the current state captures all relevant information from the history.
Consider a robot that collects empty soda cans in an office environment.
 States: Battery charge level: High or Low.
 Actions:
 Search: Look for cans.
 Wait: Stay stationary and wait for someone to bring a can.
 Recharge: Go to the charging station (only allowed in Low state).
 Rewards:
 Search: Reward $r_{\text{search}}$, e.g., +10.
 Wait: Reward $r_{\text{wait}}$, e.g., +1.
 Recharge: Reward 0.
 Rescue Needed: If the robot runs out of power, negative reward $r_{\text{rescue}}$, e.g., 20.
Transition Dynamics:
 From High State:
 Search:
 Remains in High with probability $\alpha$.
 Transitions to Low with probability $1  \alpha$.
 Wait: Remains in High.
 Search:
 From Low State:
 Search:
 Battery depletes (needs rescue) with probability $1  \beta$.
 Transition to High (after rescue).
 Reward: $r_{\text{rescue}}$ (negative).
 Remains in Low with probability $\beta$.
 Reward: $r_{\text{search}}$.
 Battery depletes (needs rescue) with probability $1  \beta$.
 Recharge: Transitions to High.
 Reward: 0.
 Search:
Returns and Episodes
The goal in reinforcement learning is to maximize the expected cumulative reward, also known as the return.
Episodic Tasks
 Definition: Tasks that naturally break into episodes, which are sequences that end in a terminal state.
 Examples:
 Games: Each game is an episode that starts and ends independently of others.
 Navigation Tasks: Reaching a destination concludes an episode.
In episodic tasks, the interaction breaks into episodes ending at time $T$. The return is defined as:
$G_t \doteq R_{t+1} + R_{t+2} + R_{t+3} + \dots + R_T = \sum_{k=0}^{T  t  1} R_{t + k + 1}$Continuing Tasks
 Definition: Tasks where the agentenvironment interaction continues indefinitely without a terminal state.
 Examples:
 Job Scheduling: Jobs arrive continuously, and the agent must make decisions at each time step.
 Stock Trading: The market operates continuously, and trading decisions are made over an indefinite period.
In continuing tasks, where the interaction goes on indefinitely, we introduce a discount factor $\gamma$ ($0 \leq \gamma < 1$) to ensure the cumulative reward is finite:
$G_t \doteq R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$ Effect of $\gamma$:
 Shortsighted ($\gamma \approx 0$): Agent focuses on immediate rewards.
 Farsighted ($\gamma \approx 1$): Future rewards are weighted the same as immediate rewards.
The return can be defined recursively:
$G_t = R_{t+1} + \gamma G_{t+1}$This property is fundamental in deriving many reinforcement learning algorithms.
Throughout the course, we'll use a canonical example called Grid World to build intuition about MDPs and RL algorithms. In grid world, an agent navigates a grid to reach a goal while avoiding obstacles and pitfalls, with rewards assigned to reaching specific squares.
In our grid world, the agent in this world can move in four directions, if it tries to move out of the grid or into obstacles, it stays in the same cell. If it lands on a minusone square, it receives a reward of 1 and the episode ends; if it lands on a plusone square, it receives a reward of +1 and the episode ends.
Policies and Value Functions
A policy defines the agent's behavior at each state. It is a mapping from states to actions.

Deterministic Policy ($\pi$): Maps each state to a specific action.
$\pi(s) = a$ 
Stochastic Policy ($\pi(as)$): Assigns probabilities to each action in a state.
$\pi(as) \doteq \text{Pr}(A_t = a  S_t = s)$
A deterministic policy for our Grid World is:
Value functions estimate how good it is for an agent to be in a given state (or stateaction pair), considering future rewards.

StateValue Function ($v_\pi(s)$): Expected return starting from state $s$ and following policy $\pi$.
$v_\pi(s) \doteq \mathbb{E}_\pi [G_t  S_t = s] = \mathbb{E}_\pi \bigg[ \sum_{k=0}^\infty \gamma^k R_{t+k+1} \big S_t = s \bigg], \forall s \in \mathcal{S}$ 
ActionValue Function ($q_\pi(s, a)$): Expected return starting from state $s$, taking action $a$, and thereafter following policy $\pi$.
$q_\pi(s, a) \doteq \mathbb{E}_\pi [G_t  S_t = s, A_t = a] = \mathbb{E}_\pi \bigg[ \sum_{k=0}^\infty \gamma^k R_{t+k+1} \big S_t = s, A_t = a \bigg]$
Value functions allow agents to evaluate the desirability of states and actions without waiting for longterm outcomes.
Let's imagine three different scenarios in our Grid World:

A deterministic environment with no discounting ($\gamma$ = 1) and a horizon $H$ of 100 steps, we can calculate the value function as follows:
 $V^*(4,3)$: The agent immediately collects the +1 reward.
 $V^*(3,3)$: The agent moves right to collect the +1 reward in one step.
 $V^*(2,3)$: The agent moves right twice to collect the reward, still yielding a value of +1 due to no discounting.
 $V^*(1,1)$: It takes five steps to reach the reward, but still has a value of +1.
 $V^*(4,2)$: The agent is stuck in the bomb and receives a reward of 1.
1.001.001.001.001.000.001.001.001.001.001.001.00

Introducing a discount factor gamma of 0.9, where actions are still deterministic:
 $V^*(4,3)$: The value remains 1 because the reward is immediate.
 $V^*(3,3)$: The reward is delayed by one step, so the value is $0.9^1 \times 1 = 0.9$.
 $V^*(2,3)$: The value is delayed by two steps, so the value is $0.9^2 \times 1 = 0.81$.
 $V^*(1,1)$: The value is $0.9^5 \times 1 = 0.59$.
 $V^*(4,2)$: The value remains 1 due to the bomb.
0.730.810.901.000.660.000.811.000.590.660.730.66

Adding Stochasticity, where the action success probability drops to 80%:
 $V^*(4,3)$: The value remains 1 since grabbing the prize is certain.
 $V^*(3,3)$: Moving right has an 80% success rate. If successful, the value is $0.9 \times 1$. With a 20% chance, the agent may stay in place (trying to move up), move down or move left, affecting the value recursively based on neighboring squares.
This recursive dependency on neighboring values is captured by the Bellman Equation, which expresses the value of a state in terms of expected rewards and the values of successor states.
Bellman Equations
The Bellman equation expresses the relationship between the value of a state and the values of its successor states.
$v_\pi(s) = \sum_{a} \pi(as) \sum_{s', r} p(s', r  s, a) \left[ r + \gamma v_\pi(s') \right]$ The value of state $s$ is the expected return from taking action $a$, receiving reward $r$, and transitioning to state $s'$, plus the discounted value of $s'$.
Similarly, for the actionvalue function:
$q_\pi(s, a) = \sum_{s', r} p(s', r  s, a) \left[ r + \gamma \sum_{a'} \pi(a's') q_\pi(s', a') \right]$Optimality in MDPs
An optimal policy $\pi^*$ yields the highest value for all states compared to all other policies.

Definition:
$v_{\pi^*}(s) \geq v_\pi(s), \quad \forall s \in \mathcal{S}, \forall \pi$ 
There may be multiple optimal policies that share the same optimal value function.

Optimal StateValue Function ($v_*(s)$):
$v_*(s) = \max_\pi v_\pi(s)$ 
Optimal ActionValue Function ($q_*(s, a)$):
$q_*(s, a) = \max_\pi q_\pi(s, a)$
Bellman Optimality Equations
These equations define the optimal value functions recursively.

For $v_*(s)$:
$v_*(s) = \max_{a} \sum_{s', r} p(s', r  s, a) \left[ r + \gamma v_*(s') \right]$ 
For $q_*(s, a)$:
$q_*(s, a) = \sum_{s', r} p(s', r  s, a) \left[ r + \gamma \max_{a'} q_*(s', a') \right]$
Once we have $v_*(s)$ or $q_*(s, a)$, we can derive an optimal policy:

From $v_*(s)$:
$\pi^*(s) = \arg\max_{a} \sum_{s', r} P(s', r  s, a) \left[ r + \gamma v_*(s') \right]$ 
From $q_*(s, a)$:
$\pi^*(s) = \arg\max_{a} q_*(s, a)$
Recap
In this chapter, we expanded RL by introducing the Markov Decision Process (MDP) framework. We've covered:
 MDPs capture sequential decisionmaking where actions influence future states and rewards.
 Components of MDPs:
 States ($S$): The situations the agent can be in.
 Actions ($A$): The choices available to the agent.
 Transition Dynamics ($P$): Probabilities of moving between states and receiving rewards, encapsulating the environment's dynamics.
 The Goal: Maximize expected cumulative reward (return), considering both immediate and future rewards.
 Episodic vs. Continuing Tasks:
 Episodic Tasks: Break into episodes with terminal states.
 Continuing Tasks: Ongoing interaction, often modeled with discounting.
 Policies: Define the agent's behavior by mapping states to actions.
 Value Functions: Quantify the expected return from states or stateaction pairs under a policy.
 Bellman Equations: Fundamental recursive relationships that express value functions in terms of expected immediate rewards and future values.
 Optimality: Optimal policies maximize the expected return from every state, and optimal value functions provide the maximum expected return achievable.