Markov Decision Processes
In this chapter, you'll learn about:
- Markov Decision Processes (MDPs): A mathematical framework for modeling sequential decision-making.
- Policies and Value Functions: How agents decide on actions and evaluate states.
- Bellman Equations: Recursive relationships between value functions in MDPs.
- Optimality: Concepts of optimal policies and value functions.
In the previous chapter, we explored the k-armed bandit problem, which introduced us to decision-making under uncertainty. However, the bandit problem is limited because it doesn't consider situations where different actions are optimal in different states, nor does it account for the long-term consequences of actions. MDPs provide a framework to model such sequential decision-making problems where actions influence not only immediate rewards but also future states and rewards.
Introduction to MDPs
In an MDP, the agent's interaction cycle involves observing the current state, taking an action, and receiving a reward based on the new state. This reward assesses how favorable the new state is. For example, in a video game, the score might serve as the reward, while for a running robot, the distance covered could determine the reward. The agent's objective is to maximize its cumulative reward by learning the optimal action for each situation through trial and error.
In an MDP, the interaction between the agent and the environment is modeled at discrete time steps, , and defined by the components:
- Set of States (): The different situations the agent can encounter.
- Set of Actions (): The possible actions the agent can take.
- Transition Function (): This defines the probability of transitioning to a new state given the current state and action . Also called the environment dynamics.
- Reward Function (): This assigns a reward based on the transition from state to state via action .
- Start State (): The initial state or distribution of states where the agent begins.
- Horizon (): The length of time over which the agent will operate and make decisions.
The process of repeated agent-environment interactions generates a trajectory of experience:
An MDP is a formalism that describes an environment in which all states and actions are Markov, meaning the future is independent of the past given the present state. It encapsulates the dynamics of the environment and the agent's interaction with it over discrete time steps.
The future state and reward depend only on the current state and action, not on past states or actions. This property implies that the current state captures all relevant information from the history.
Consider a robot that collects empty soda cans in an office environment.
- States: Battery charge level: High or Low.
- Actions:
- Search: Look for cans.
- Wait: Stay stationary and wait for someone to bring a can.
- Recharge: Go to the charging station (only allowed in Low state).
- Rewards:
- Search: Reward , e.g., +10.
- Wait: Reward , e.g., +1.
- Recharge: Reward 0.
- Rescue Needed: If the robot runs out of power, negative reward , e.g., -20.
Transition Dynamics:
- From High State:
- Search:
- Remains in High with probability .
- Transitions to Low with probability .
- Wait: Remains in High.
- Search:
- From Low State:
- Search:
- Battery depletes (needs rescue) with probability .
- Transition to High (after rescue).
- Reward: (negative).
- Remains in Low with probability .
- Reward: .
- Battery depletes (needs rescue) with probability .
- Recharge: Transitions to High.
- Reward: 0.
- Search:
Returns and Episodes
The goal in reinforcement learning is to maximize the expected cumulative reward, also known as the return.
Episodic Tasks
- Definition: Tasks that naturally break into episodes, which are sequences that end in a terminal state.
- Examples:
- Games: Each game is an episode that starts and ends independently of others.
- Navigation Tasks: Reaching a destination concludes an episode.
In episodic tasks, the interaction breaks into episodes ending at time . The return is defined as:
Continuing Tasks
- Definition: Tasks where the agent-environment interaction continues indefinitely without a terminal state.
- Examples:
- Job Scheduling: Jobs arrive continuously, and the agent must make decisions at each time step.
- Stock Trading: The market operates continuously, and trading decisions are made over an indefinite period.
In continuing tasks, where the interaction goes on indefinitely, we introduce a discount factor () to ensure the cumulative reward is finite:
- Effect of :
- Short-sighted (): Agent focuses on immediate rewards.
- Far-sighted (): Future rewards are weighted the same as immediate rewards.
The return can be defined recursively:
This property is fundamental in deriving many reinforcement learning algorithms.
Throughout the course, we'll use a canonical example called Grid World to build intuition about MDPs and RL algorithms. In grid world, an agent navigates a grid to reach a goal while avoiding obstacles and pitfalls, with rewards assigned to reaching specific squares.
In our grid world, the agent in this world can move in four directions, if it tries to move out of the grid or into obstacles, it stays in the same cell. If it lands on a minus-one square, it receives a reward of -1 and the episode ends; if it lands on a plus-one square, it receives a reward of +1 and the episode ends.
Policies and Value Functions
A policy defines the agent's behavior at each state. It is a mapping from states to actions.
-
Deterministic Policy (): Maps each state to a specific action.
-
Stochastic Policy (): Assigns probabilities to each action in a state.
A deterministic policy for our Grid World is:
Value functions estimate how good it is for an agent to be in a given state (or state-action pair), considering future rewards.
-
State-Value Function (): Expected return starting from state and following policy .
-
Action-Value Function (): Expected return starting from state , taking action , and thereafter following policy .
Value functions allow agents to evaluate the desirability of states and actions without waiting for long-term outcomes.
Let's imagine three different scenarios in our Grid World:
-
A deterministic environment with no discounting ( = 1) and a horizon of 100 steps, we can calculate the value function as follows:
- : The agent immediately collects the +1 reward.
- : The agent moves right to collect the +1 reward in one step.
- : The agent moves right twice to collect the reward, still yielding a value of +1 due to no discounting.
- : It takes five steps to reach the reward, but still has a value of +1.
- : The agent is stuck in the bomb and receives a reward of -1.
1.001.001.000.001.000.001.000.001.001.001.001.00
-
Introducing a discount factor gamma of 0.9, where actions are still deterministic:
- : The value remains 1 because the reward is immediate.
- : The reward is delayed by one step, so the value is .
- : The value is delayed by two steps, so the value is .
- : The value is .
- : The value remains -1 due to the bomb.
0.730.810.900.000.660.000.810.000.590.660.730.66
-
Adding Stochasticity, where the action success probability drops to 80%:
- : The value remains 1 since grabbing the prize is certain.
- : Moving right has an 80% success rate. If successful, the value is . With a 20% chance, the agent may stay in place (trying to move up), move down or move left, affecting the value recursively based on neighboring squares.
This recursive dependency on neighboring values is captured by the Bellman Equation, which expresses the value of a state in terms of expected rewards and the values of successor states.
Bellman Equations
The Bellman equation expresses the relationship between the value of a state and the values of its successor states.
- The value of state is the expected return from taking action , receiving reward , and transitioning to state , plus the discounted value of .
Similarly, for the action-value function:
Optimality in MDPs
An optimal policy yields the highest value for all states compared to all other policies.
-
Definition:
There may be multiple optimal policies that share the same optimal value function.
-
Optimal State-Value Function ():
-
Optimal Action-Value Function ():
Bellman Optimality Equations
These equations define the optimal value functions recursively.
-
For :
-
For :
Once we have or , we can derive an optimal policy:
-
From :
-
From :
Recap
1. What is a Markov Decision Process (MDP)?
2. Which property defines the Markov assumption?
3. What does the transition function represent in an MDP?
4. How is the value of a state defined under policy ?
5. What is the Bellman equation used for?
What's Next
In the next chapter, we introduce Dynamic Programming (DP), a set of methods for solving Markov Decision Processes (MDPs) when we have access to the environment's dynamics. You'll learn how to compute value functions, improve policies, and derive optimal policies using concepts like policy evaluation, policy improvement, and Generalized Policy Iteration (GPI).