Monte Carlo Methods

info

In this chapter, you'll learn about:

Monte Carlo Methods: Techniques for learning directly from episodes of experience.
Monte Carlo Prediction and Control: Estimating value functions and finding optimal policies without a model.
Exploration Strategies: Addressing the exploration-exploitation trade-off in Monte Carlo methods.
Off-Policy Learning: Learning about one policy while following another.
Importance Sampling: Adjusting returns to correct for differences between behavior and target policies.

In the previous chapter, we explored how Dynamic Programming (DP) methods can solve Markov Decision Processes when we have a complete model of the environment's dynamics. However, in many real-world situations, we might not know these dynamics or they could be too complex to model accurately. Monte Carlo (MC) methods address this limitation by learning directly from raw experience without requiring a model of the environment. In this chapter, we'll see how MC methods learn state-values and policies from episodes of interaction with the environment.

Introduction to Monte Carlo (MC) Methods

MC methods involve learning value functions and optimal policies by averaging sample returns from episodes of interaction with the environment. Unlike dynamic programming, which requires complete knowledge of the environment's dynamics, MC methods learn directly from sampled experience.

Key characteristics of Monte Carlo methods:

Model-Free: They do not require a model of the environment's transition probabilities.
Episodic Tasks: They are particularly suited for tasks that naturally break into episodes, which eventually terminate.
Experience-Based Learning: They estimate value functions by averaging over observed returns.

Monte Carlo Prediction

Monte Carlo Prediction focuses on estimating the state-value function $v_\pi(s)$ for a given policy $\pi$ by averaging the returns observed after visiting state $s$ .

First-visit MC prediction

\begin{align*} &\textbf{Input } \pi, \text{ the policy to be evaluated} \\ &\textbf{Initialize:}\\ &\qquad V(s) \in \mathbb{R}, \text{ for all } s \in \mathcal{S}\\ &\qquad \textit{Returns}(s) \leftarrow \text{ an empty list, } \forall s \in \mathcal{S}\\[1em] &\textbf{Loop forever (for each episode):} \\ &\qquad \textbf{Generate } \text{an episode following } \pi: \\ &\qquad\quad S_0, A_0, R_1, S_1, A_1, R_2, \cdots, S_{T-1}, A_{T-1}, R_T \\ &\qquad G \leftarrow 0\\ &\qquad \textbf{Loop for each } \text{step of episode, } t=T-1,T-2,\cdots,0 :\\ &\qquad\qquad G \leftarrow \gamma G + R_{t+1} \\ &\qquad\qquad \textbf{Unless } S_t \text{ appears in } S_0, S_1, \cdots, S_{t-1}:\\ &\qquad\qquad\qquad \textbf{Append } G \text{ to } \textit{Returns}(S_t) \\ &\qquad\qquad\qquad V(S_t) \leftarrow \text{average}(\textit{Returns}(S_t)) \\ \end{align*}

There are two main variations of on-policy MC prediction discussed in the book:

First-Visit MC: Estimates the value of a state $s$ by averaging the returns following the first ocurrence of $s$ in each episode.
Every-Visit MC: Estimates the value by averaging the returns following every ocurrence of $s$ in each episode.

Both methods converge quadratically to the true expected values as the number of visits to each state approaches infinity.

In Monte Carlo methods, the value estimates for each state are independent of each other. Unlike Dynamic Programming, which bootstraps by updating estimates based on other estimates, MC methods do not rely on other state values. This independence means that the estimation of $V(s)$ depends solely on the returns observed after visiting state $s$ . This leads to updates with higher variance when compared to DP methods.

The computational cost of estimating the value of a single state is independent of the size of the state space. Since MC methods do not require sweeping over all states, we can focus computational resources on states of interest. We can generate many sample episodes starting from specific states and average the returns to estimate their values.

Monte Carlo for Action Values

While estimating state values is useful, we often need to learn action-value functions $q_\pi(s, a)$ to derive policies since we don't have the model of the environment as in DP. This involves estimating the expected return after taking action $a$ in state $s$ and thereafter following policy $\pi$ .

A challenge in estimating action values with MC methods is ensuring that all state-action pairs are explored sufficiently. Without adequate exploration, some actions may never be tried in certain states, leading to incomplete or biased estimates. Two strategies that try to solve this problem are:

Exploring Starts: Randomly initialize episodes with every possible state-action pair, ensuring all actions are tried.
Stochastic Policies: Use policies that have a non-zero probability of selecting all actions (e.g., $\epsilon$ -greedy policies).

Monte Carlo Control

By combining policy evaluation and policy improvement steps, we can create a Monte Carlo control algorithm within the Generalized Policy Iteration (GPI) framework.

Monte Carlo Exploring Starts (ES)

\begin{align*} &\textbf{Initialize:}\\ &\qquad \pi(s) \in \mathcal{A}(s) \text{ (arbitrarily)}, \forall s \in \mathcal{S}\\ &\qquad Q(s,a) \in \mathbb{R} \text{ (arbitrarily)}, \forall s \in \mathcal{S}, a \in \mathcal{A}(s)\\ &\qquad \textit{Returns}(s,a) \leftarrow \text{ an empty list, } \forall s \in \mathcal{S}, a \in \mathcal{A}(s)\\[1em] &\textbf{Loop forever (for each episode):} \\ &\qquad \textbf{Choose } S_0 \in \mathcal{S}, A_0 \in \mathcal{A}(S_0) \text{ randomly such that all pairs}\\ &\qquad\quad \text{have probability} > 0\\ &\qquad \textbf{Generate } \text{an episode following } \pi: \\ &\qquad\quad S_0, A_0, R_1, S_1, A_1, R_2, \cdots, S_{T-1}, A_{T-1}, R_T \\ &\qquad G \leftarrow 0\\ &\qquad \textbf{Loop for each } \text{step of episode, } t=T-1,T-2,\cdots,0 :\\ &\qquad\qquad G \leftarrow \gamma G + R_{t+1} \\ &\qquad\qquad \textbf{Unless the pair } S_t, A_t \text{ appears in }\\ &\qquad\qquad\quad S_0, A_0, S_1, A_1, \cdots, S_{t-1}, A_{t-1}: \\ &\qquad\qquad\qquad \textbf{Append } G \text{ to } \textit{Returns}(S_t, A_t) \\ &\qquad\qquad\qquad Q(S_t, A_t) \leftarrow \text{average}(\textit{Returns}(S_t, A_t)) \\ &\qquad\qquad\qquad \pi(S_t) \leftarrow \textstyle\argmax_a Q(S_t, A_t) \end{align*}

Epsilon-Soft Policies

While Exploring Starts ensure all actions are tried, they may not be practical in many real-world problems. Instead, we can use $\epsilon$ -soft policies to maintain exploration.

Definition: An $\epsilon$ -soft policy is a stochastic policy where every action has a probability of at least $\frac{\epsilon}{|\mathcal{A}(s)|}$ , ensuring continual exploration.
$\epsilon$ -Greedy Policy: A common $\epsilon$ -soft policy where the agent chooses the best-known action with probability $1 - \epsilon$ and a random action with probability $\epsilon$ .
Limitation: $\epsilon$ -soft policies may not converge to the true optimal policy but to an optimal $\epsilon$ -soft policy.

We can modify our Monte Carlo control algorithm to work with $\epsilon$ -soft policies:

On-policy MC control for epsilon-soft policies

\begin{align*} &\textbf{Algorithm parameter:} \text{ small } \epsilon > 0\\ &\textbf{Initialize:}\\ &\qquad \pi \leftarrow \text{ an arbitrary } \epsilon \text{-soft policy}\\ &\qquad Q(s,a) \in \mathbb{R} \text{ (arbitrarily)}, \forall s \in \mathcal{S}, a \in \mathcal{A}(s)\\ &\qquad \textit{Returns}(s,a) \leftarrow \text{ an empty list, } \forall s \in \mathcal{S}, a \in \mathcal{A}(s)\\[1em] &\textbf{Loop forever (for each episode):} \\ &\qquad \textbf{Generate } \text{an episode following } \pi: \\ &\qquad\quad S_0, A_0, R_1, S_1, A_1, R_2, \cdots, S_{T-1}, A_{T-1}, R_T \\ &\qquad G \leftarrow 0\\ &\qquad \textbf{Loop for each } \text{step of episode, } t=T-1,T-2,\cdots,0 :\\ &\qquad\qquad G \leftarrow \gamma G + R_{t+1} \\ &\qquad\qquad \textbf{Unless the pair } S_t, A_t \text{ appears in }\\ &\qquad\qquad\quad S_0, A_0, S_1, A_1, \cdots, S_{t-1}, A_{t-1}: \\ &\qquad\qquad\qquad \textbf{Append } G \text{ to } \textit{Returns}(S_t, A_t) \\ &\qquad\qquad\qquad Q(S_t, A_t) \leftarrow \text{average}(\textit{Returns}(S_t, A_t)) \\ &\qquad\qquad\qquad A^* \leftarrow \textstyle\argmax_a Q(S_t, A_t) \\ &\qquad\qquad\qquad \textbf{For all } a \in \mathcal{A}(S_t): \\ &\qquad\qquad\qquad\qquad \pi(a|S_t) \leftarrow \begin{cases} 1-\epsilon+\epsilon/|\mathcal{A}(S_t)| \quad& \text{if } a = A^*\\ \epsilon/|\mathcal{A}(S_t)| & \text{if } a \neq A^* \end{cases} \end{align*}

Off-Policy Monte Carlo Methods

In some cases, we may want to learn about a target policy $\pi$ while following a different behavior policy $b$ . This is known as off-policy learning.

Advantages:

The behavior policy $b$ can be designed to ensure sufficient exploration.
We can learn the optimal policy $\pi^*$ while behaving in a way that is safe or exploratory.

Importance Sampling

To adjust for the difference between the target and behavior policies, we use importance sampling.

Objective: Estimate the expected value under the target distribution using samples from the behavior distribution.
Importance Sampling Ratio:
$\rho_t = \frac{\pi(a_t | s_t)}{b(a_t | s_t)}$
Adjusted Return:
$G_t^\pi = \rho_{t:T-1} G_t^b$
Where $\rho_{t:T-1} = \prod_{k=t}^{T-1} \rho_k$ .

Off-Policy Monte Carlo Prediction

We can modify the Monte Carlo prediction algorithm to perform off-policy learning:

Off-policy MC prediction (policy evaluation)

\begin{align*} &\textbf{Input: } \text{an arbitrary target policy } \pi \\ &\textbf{Initialize} \text{, for all } s \in \mathcal{S}, a \in \mathcal{A}(s):\\ &\qquad Q(s,a) \in \mathbb{R} \text{ (arbitrarily)} \\ &\qquad C(s,a) \leftarrow 0 \\[1em] &\textbf{Loop forever (for each episode):} \\ &\qquad b \leftarrow \text{ any policy with coverage of } \pi\\ &\qquad \textbf{Generate } \text{an episode following } b: \\ &\qquad\quad S_0, A_0, R_1, S_1, A_1, R_2, \cdots, S_{T-1}, A_{T-1}, R_T \\ &\qquad G \leftarrow 0\\ &\qquad W \leftarrow 1\\ &\qquad \textbf{Loop for each } \text{step of episode, } t=T-1,\cdots,0, \text{ while } W \neq 0 :\\ &\qquad\qquad G \leftarrow \gamma G + R_{t+1} \\ &\qquad\qquad C(S_t, A_t) \leftarrow C(S_t, A_t) + W \\ &\qquad\qquad Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \textstyle\frac{W}{C(S_t,A_t)} [G - Q(S_t, A_t)] \\ &\qquad\qquad W \leftarrow W \textstyle\frac{\pi(A_t|S_t)}{b(A_t|S_t)} \end{align*}

Considerations in Off-Policy Learning

Importance sampling can introduce high variance in the estimates, especially if the behavior and target policies are very different.
Off-policy learning allows the agent to explore safely while still learning about the optimal policy.

Off-Policy Monte Carlo Control

We can extend off-policy learning to control tasks to find the optimal policy.

Off-policy MC control

\begin{align*} &\textbf{Initialize} \text{, for all } s \in \mathcal{S}, a \in \mathcal{A}(s):\\ &\qquad Q(s,a) \in \mathbb{R} \text{ (arbitrarily)} \\ &\qquad C(s,a) \leftarrow 0 \\ &\qquad \pi(s) \leftarrow \textstyle\argmax_a Q(s,a)\\[1em] &\textbf{Loop forever (for each episode):} \\ &\qquad b \leftarrow \text{ any soft policy}\\ &\qquad \textbf{Generate } \text{an episode following } b: \\ &\qquad\quad S_0, A_0, R_1, S_1, A_1, R_2, \cdots, S_{T-1}, A_{T-1}, R_T \\ &\qquad G \leftarrow 0\\ &\qquad W \leftarrow 1\\ &\qquad \textbf{Loop for each } \text{step of episode, } t=T-1,\cdots,0:\\ &\qquad\qquad G \leftarrow \gamma G + R_{t+1} \\ &\qquad\qquad C(S_t, A_t) \leftarrow C(S_t, A_t) + W \\ &\qquad\qquad Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \textstyle\frac{W}{C(S_t,A_t)} [G - Q(S_t, A_t)] \\ &\qquad\qquad \pi(S_t) \leftarrow \textstyle\argmax_a Q(S_t, a)\\ &\qquad\qquad \textbf{If } A_t \neq \pi(S_t) \text{ then exit inner Loop (to next episode)} \\ &\qquad\qquad W \leftarrow W \textstyle\frac{1}{b(A_t|S_t)} \end{align*}

Recap

What's Next

While MC methods are model-free and work well with episodic tasks, they rely on complete episodes for updates, limiting their use in continuing tasks. In the next chapter, we'll introduce Temporal-Difference (TD) Learning, a hybrid approach that combines the strengths of Monte Carlo and Dynamic Programming methods. TD methods enable incremental updates after each time step, making them efficient for both episodic and continuing tasks.

Monte Carlo Methods

Introduction to Monte Carlo (MC) Methods

Monte Carlo Prediction

Monte Carlo for Action Values

Monte Carlo Control

Epsilon-Soft Policies

Off-Policy Monte Carlo Methods

Advantages:

Importance Sampling

Off-Policy Monte Carlo Prediction

Considerations in Off-Policy Learning

Off-Policy Monte Carlo Control

Recap

1. What is the key distinction between Monte Carlo methods and Dynamic Programming?

2. In Monte Carlo prediction, how is the state-value function $v_\pi(s)$ estimated?

3. What is the primary limitation of the Exploring Starts method in Monte Carlo control?

4. What is the role of $\epsilon$ in an $\epsilon$ -greedy policy?

5. What does the importance sampling ratio in off-policy Monte Carlo methods represent?

What's Next

Introduction to Monte Carlo (MC) Methods​

Monte Carlo Prediction​

Monte Carlo for Action Values​

Monte Carlo Control​

Epsilon-Soft Policies​

Off-Policy Monte Carlo Methods​

Advantages:​

Importance Sampling​

Off-Policy Monte Carlo Prediction​

Considerations in Off-Policy Learning​

Off-Policy Monte Carlo Control​

Recap​

1. What is the key distinction between Monte Carlo methods and Dynamic Programming?

2. In Monte Carlo prediction, how is the state-value function vπ(s)v_\pi(s)vπ​(s) estimated?

3. What is the primary limitation of the Exploring Starts method in Monte Carlo control?

4. What is the role of ϵ\epsilonϵ in an ϵ\epsilonϵ-greedy policy?

5. What does the importance sampling ratio in off-policy Monte Carlo methods represent?

What's Next​

Introduction to Monte Carlo (MC) Methods

Monte Carlo Prediction

Monte Carlo for Action Values

Monte Carlo Control

Epsilon-Soft Policies

Off-Policy Monte Carlo Methods

Advantages:

Importance Sampling

Off-Policy Monte Carlo Prediction

Considerations in Off-Policy Learning

Off-Policy Monte Carlo Control

Recap

2. In Monte Carlo prediction, how is the state-value function $v_\pi(s)$ estimated?

4. What is the role of $\epsilon$ in an $\epsilon$ -greedy policy?

What's Next