Skip to main content

Markov Decision Processes

info

In this chapter, you'll learn about:

  • Markov Decision Processes (MDPs): A mathematical framework for modeling sequential decision-making.
  • Policies and Value Functions: How agents decide on actions and evaluate states.
  • Bellman Equations: Recursive relationships between value functions in MDPs.
  • Optimality: Concepts of optimal policies and value functions.

In the previous chapter, we explored the k-armed bandit problem, which introduced us to decision-making under uncertainty. However, the bandit problem is limited because it doesn't consider situations where different actions are optimal in different states, nor does it account for the long-term consequences of actions. MDPs provide a framework to model such sequential decision-making problems where actions influence not only immediate rewards but also future states and rewards.

Introduction to MDPs

In an MDP, the agent's interaction cycle involves observing the current state, taking an action, and receiving a reward based on the new state. This reward assesses how favorable the new state is. For example, in a video game, the score might serve as the reward, while for a running robot, the distance covered could determine the reward. The agent's objective is to maximize its cumulative reward by learning the optimal action for each situation through trial and error.

In an MDP, the interaction between the agent and the environment is modeled at discrete time steps, t=0,1,2,3,t=0,1,2,3, \cdots, and defined by the components:

  1. Set of States (S\mathcal{S}): The different situations the agent can encounter.
  2. Set of Actions (A\mathcal{A}): The possible actions the agent can take.
  3. Transition Function (p(ss,a)p(s'|s,a)): This defines the probability of transitioning to a new state ss' given the current state ss and action aa. Also called the environment dynamics.
p(ss,a)Pr(St=sSt1=s,At1=a)p(s'|s,a) \doteq \text{Pr}(S_t=s'|S_{t-1}=s,A_{t-1}=a)
  1. Reward Function (r(s,a,s)r(s,a,s')): This assigns a reward based on the transition from state ss to state ss' via action aa.
r(s,a,s)E[RtSt1=s,At1=a]r(s,a,s') \doteq \mathbb{E}[R_t|S_{t-1}=s,A_{t-1}=a]
  1. Start State (s0s_0): The initial state or distribution of states where the agent begins.
  2. Horizon (HH): The length of time over which the agent will operate and make decisions.

The process of repeated agent-environment interactions generates a trajectory of experience:

S0,A0,R1,S1,A1,R2,S2,,SH1,AH1,RHS_0, A_0, R_1, S_1, A_1, R_2, S_2, \dots, S_{H-1}, A_{H-1}, R_H
Loading...

An MDP is a formalism that describes an environment in which all states and actions are Markov, meaning the future is independent of the past given the present state. It encapsulates the dynamics of the environment and the agent's interaction with it over discrete time steps.

Markov Property

The future state and reward depend only on the current state and action, not on past states or actions. This property implies that the current state captures all relevant information from the history.

Example: Recycling Robot

Consider a robot that collects empty soda cans in an office environment.

  • States: Battery charge level: High or Low.
  • Actions:
    • Search: Look for cans.
    • Wait: Stay stationary and wait for someone to bring a can.
    • Recharge: Go to the charging station (only allowed in Low state).
  • Rewards:
    • Search: Reward rsearchr_{\text{search}}, e.g., +10.
    • Wait: Reward rwaitr_{\text{wait}}, e.g., +1.
    • Recharge: Reward 0.
    • Rescue Needed: If the robot runs out of power, negative reward rrescuer_{\text{rescue}}, e.g., -20.

Transition Dynamics:

  • From High State:
    • Search:
      • Remains in High with probability α\alpha.
      • Transitions to Low with probability 1α1 - \alpha.
    • Wait: Remains in High.
  • From Low State:
    • Search:
      • Battery depletes (needs rescue) with probability 1β1 - \beta.
        • Transition to High (after rescue).
        • Reward: rrescuer_{\text{rescue}} (negative).
      • Remains in Low with probability β\beta.
        • Reward: rsearchr_{\text{search}}.
    • Recharge: Transitions to High.
      • Reward: 0.

Returns and Episodes

Goal in RL

The goal in reinforcement learning is to maximize the expected cumulative reward, also known as the return.

Episodic Tasks

  • Definition: Tasks that naturally break into episodes, which are sequences that end in a terminal state.
  • Examples:
    • Games: Each game is an episode that starts and ends independently of others.
    • Navigation Tasks: Reaching a destination concludes an episode.

In episodic tasks, the interaction breaks into episodes ending at time TT. The return is defined as:

GtRt+1+Rt+2+Rt+3++RT=k=0Tt1Rt+k+1G_t \doteq R_{t+1} + R_{t+2} + R_{t+3} + \dots + R_T = \sum_{k=0}^{T - t - 1} R_{t + k + 1}

Continuing Tasks

  • Definition: Tasks where the agent-environment interaction continues indefinitely without a terminal state.
  • Examples:
    • Job Scheduling: Jobs arrive continuously, and the agent must make decisions at each time step.
    • Stock Trading: The market operates continuously, and trading decisions are made over an indefinite period.

In continuing tasks, where the interaction goes on indefinitely, we introduce a discount factor γ\gamma (0γ<10 \leq \gamma < 1) to ensure the cumulative reward is finite:

GtRt+1+γRt+2+γ2Rt+3+=k=0γkRt+k+1G_t \doteq R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}
  • Effect of γ\gamma:
    • Short-sighted (γ0\gamma \approx 0): Agent focuses on immediate rewards.
    • Far-sighted (γ1\gamma \approx 1): Future rewards are weighted the same as immediate rewards.
Recursive Property of Return

The return can be defined recursively:

Gt=Rt+1+γGt+1G_t = R_{t+1} + \gamma G_{t+1}

This property is fundamental in deriving many reinforcement learning algorithms.

Grid World

Throughout the course, we'll use a canonical example called Grid World to build intuition about MDPs and RL algorithms. In grid world, an agent navigates a grid to reach a goal while avoiding obstacles and pitfalls, with rewards assigned to reaching specific squares.

In our grid world, the agent in this world can move in four directions, if it tries to move out of the grid or into obstacles, it stays in the same cell. If it lands on a minus-one square, it receives a reward of -1 and the episode ends; if it lands on a plus-one square, it receives a reward of +1 and the episode ends.

Loading...

Policies and Value Functions

A policy defines the agent's behavior at each state. It is a mapping from states to actions.

  • Deterministic Policy (π\pi): Maps each state to a specific action.

    π(s)=a\pi(s) = a
  • Stochastic Policy (π(as)\pi(a|s)): Assigns probabilities to each action in a state.

    π(as)Pr(At=aSt=s)\pi(a|s) \doteq \text{Pr}(A_t = a | S_t = s)
Grid World

A deterministic policy for our Grid World is:

Value functions estimate how good it is for an agent to be in a given state (or state-action pair), considering future rewards.

  • State-Value Function (vπ(s)v_\pi(s)): Expected return starting from state ss and following policy π\pi.

    vπ(s)Eπ[GtSt=s]=Eπ[k=0γkRt+k+1St=s],sSv_\pi(s) \doteq \mathbb{E}_\pi [G_t | S_t = s] = \mathbb{E}_\pi \bigg[ \sum_{k=0}^\infty \gamma^k R_{t+k+1} \big| S_t = s \bigg], \forall s \in \mathcal{S}
  • Action-Value Function (qπ(s,a)q_\pi(s, a)): Expected return starting from state ss, taking action aa, and thereafter following policy π\pi.

    qπ(s,a)Eπ[GtSt=s,At=a]=Eπ[k=0γkRt+k+1St=s,At=a]q_\pi(s, a) \doteq \mathbb{E}_\pi [G_t | S_t = s, A_t = a] = \mathbb{E}_\pi \bigg[ \sum_{k=0}^\infty \gamma^k R_{t+k+1} \big| S_t = s, A_t = a \bigg]

Value functions allow agents to evaluate the desirability of states and actions without waiting for long-term outcomes.

Grid World

Let's imagine three different scenarios in our Grid World:

  1. A deterministic environment with no discounting (γ\gamma = 1) and a horizon HH of 100 steps, we can calculate the value function as follows:

    • V(4,3)V^*(4,3): The agent immediately collects the +1 reward.
    • V(3,3)V^*(3,3): The agent moves right to collect the +1 reward in one step.
    • V(2,3)V^*(2,3): The agent moves right twice to collect the reward, still yielding a value of +1 due to no discounting.
    • V(1,1)V^*(1,1): It takes five steps to reach the reward, but still has a value of +1.
    • V(4,2)V^*(4,2): The agent is stuck in the bomb and receives a reward of -1.

    1.00
    1.00
    1.00
    0.00
    1.00
    0.00
    1.00
    0.00
    1.00
    1.00
    1.00
    1.00


  2. Introducing a discount factor gamma of 0.9, where actions are still deterministic:

    • V(4,3)V^*(4,3): The value remains 1 because the reward is immediate.
    • V(3,3)V^*(3,3): The reward is delayed by one step, so the value is 0.91×1=0.90.9^1 \times 1 = 0.9.
    • V(2,3)V^*(2,3): The value is delayed by two steps, so the value is 0.92×1=0.810.9^2 \times 1 = 0.81.
    • V(1,1)V^*(1,1): The value is 0.95×1=0.590.9^5 \times 1 = 0.59.
    • V(4,2)V^*(4,2): The value remains -1 due to the bomb.

    0.73
    0.81
    0.90
    0.00
    0.66
    0.00
    0.81
    0.00
    0.59
    0.66
    0.73
    0.66


  3. Adding Stochasticity, where the action success probability drops to 80%:

    • V(4,3)V^*(4,3): The value remains 1 since grabbing the prize is certain.
    • V(3,3)V^*(3,3): Moving right has an 80% success rate. If successful, the value is 0.9×10.9 \times 1. With a 20% chance, the agent may stay in place (trying to move up), move down or move left, affecting the value recursively based on neighboring squares.

This recursive dependency on neighboring values is captured by the Bellman Equation, which expresses the value of a state in terms of expected rewards and the values of successor states.

Bellman Equations

The Bellman equation expresses the relationship between the value of a state and the values of its successor states.

vπ(s)=aπ(as)s,rp(s,rs,a)[r+γvπ(s)]v_\pi(s) = \sum_{a} \pi(a|s) \sum_{s', r} p(s', r | s, a) \left[ r + \gamma v_\pi(s') \right]
  • The value of state ss is the expected return from taking action aa, receiving reward rr, and transitioning to state ss', plus the discounted value of ss'.

Similarly, for the action-value function:

qπ(s,a)=s,rp(s,rs,a)[r+γaπ(as)qπ(s,a)]q_\pi(s, a) = \sum_{s', r} p(s', r | s, a) \left[ r + \gamma \sum_{a'} \pi(a'|s') q_\pi(s', a') \right]

Optimality in MDPs

An optimal policy π\pi^* yields the highest value for all states compared to all other policies.

  • Definition:

    vπ(s)vπ(s),sS,πv_{\pi^*}(s) \geq v_\pi(s), \quad \forall s \in \mathcal{S}, \forall \pi

    There may be multiple optimal policies that share the same optimal value function.

  • Optimal State-Value Function (v(s)v_*(s)):

    v(s)=maxπvπ(s)v_*(s) = \max_\pi v_\pi(s)
  • Optimal Action-Value Function (q(s,a)q_*(s, a)):

    q(s,a)=maxπqπ(s,a)q_*(s, a) = \max_\pi q_\pi(s, a)

Bellman Optimality Equations

These equations define the optimal value functions recursively.

  • For v(s)v_*(s):

    v(s)=maxas,rp(s,rs,a)[r+γv(s)]v_*(s) = \max_{a} \sum_{s', r} p(s', r | s, a) \left[ r + \gamma v_*(s') \right]
  • For q(s,a)q_*(s, a):

    q(s,a)=s,rp(s,rs,a)[r+γmaxaq(s,a)]q_*(s, a) = \sum_{s', r} p(s', r | s, a) \left[ r + \gamma \max_{a'} q_*(s', a') \right]

Once we have v(s)v_*(s) or q(s,a)q_*(s, a), we can derive an optimal policy:

  • From v(s)v_*(s):

    π(s)=argmaxas,rP(s,rs,a)[r+γv(s)]\pi^*(s) = \arg\max_{a} \sum_{s', r} P(s', r | s, a) \left[ r + \gamma v_*(s') \right]
  • From q(s,a)q_*(s, a):

    π(s)=argmaxaq(s,a)\pi^*(s) = \arg\max_{a} q_*(s, a)

Recap

1. What is a Markov Decision Process (MDP)?

2. Which property defines the Markov assumption?

3. What does the transition function p(ss,a)p(s'|s,a) represent in an MDP?

4. How is the value of a state vπ(s)v_\pi(s) defined under policy π\pi?

5. What is the Bellman equation used for?

What's Next

In the next chapter, we introduce Dynamic Programming (DP), a set of methods for solving Markov Decision Processes (MDPs) when we have access to the environment's dynamics. You'll learn how to compute value functions, improve policies, and derive optimal policies using concepts like policy evaluation, policy improvement, and Generalized Policy Iteration (GPI).