Monte Carlo Methods: Techniques for learning directly from episodes of experience.
Monte Carlo Prediction and Control: Estimating value functions and finding optimal policies without a model.
Exploration Strategies: Addressing the exploration-exploitation trade-off in Monte Carlo methods.
Off-Policy Learning: Learning about one policy while following another.
Importance Sampling: Adjusting returns to correct for differences between behavior and target policies.
In the previous chapter, we explored how Dynamic Programming (DP) methods can solve Markov Decision Processes when we have a complete model of the environment's dynamics. However, in many real-world situations, we might not know these dynamics or they could be too complex to model accurately. Monte Carlo (MC) methods address this limitation by learning directly from raw experience without requiring a model of the environment. In this chapter, we'll see how MC methods learn state-values and policies from episodes of interaction with the environment.
MC methods involve learning value functions and optimal policies by averaging sample returns from episodes of interaction with the environment. Unlike dynamic programming, which requires complete knowledge of the environment's dynamics, MC methods learn directly from sampled experience.
Key characteristics of Monte Carlo methods:
Model-Free: They do not require a model of the environment's transition probabilities.
Episodic Tasks: They are particularly suited for tasks that naturally break into episodes, which eventually terminate.
Experience-Based Learning: They estimate value functions by averaging over observed returns.
Monte Carlo Prediction focuses on estimating the state-value functionvπ(s) for a given policy π by averaging the returns observed after visiting state s.
First-visit MC prediction
Input π, the policy to be evaluatedInitialize:V(s)∈R, for all s∈SReturns(s)← an empty list, ∀s∈SLoop forever (for each episode):Generate an episode following π:S0,A0,R1,S1,A1,R2,⋯,ST−1,AT−1,RTG←0Loop for each step of episode, t=T−1,T−2,⋯,0:G←γG+Rt+1Unless St appears in S0,S1,⋯,St−1:Append G to Returns(St)V(St)←average(Returns(St))
There are two main variations of on-policy MC prediction discussed in the book:
First-Visit MC: Estimates the value of a state s by averaging the returns following the first ocurrence of s in each episode.
Every-Visit MC: Estimates the value by averaging the returns following every ocurrence of s in each episode.
Both methods converge quadratically to the true expected values as the number of visits to each state approaches infinity.
In Monte Carlo methods, the value estimates for each state are independent of each other. Unlike Dynamic Programming, which bootstraps by updating estimates based on other estimates, MC methods do not rely on other state values. This independence means that the estimation of V(s) depends solely on the returns observed after visiting state s. This leads to updates with higher variance when compared to DP methods.
The computational cost of estimating the value of a single state is independent of the size of the state space. Since MC methods do not require sweeping over all states, we can focus computational resources on states of interest. We can generate many sample episodes starting from specific states and average the returns to estimate their values.
While estimating state values is useful, we often need to learn action-value functionsqπ(s,a) to derive policies since we don't have the model of the environment as in DP. This involves estimating the expected return after taking action a in state s and thereafter following policy π.
A challenge in estimating action values with MC methods is ensuring that all state-action pairs are explored sufficiently. Without adequate exploration, some actions may never be tried in certain states, leading to incomplete or biased estimates. Two strategies that try to solve this problem are:
Exploring Starts: Randomly initialize episodes with every possible state-action pair, ensuring all actions are tried.
Stochastic Policies: Use policies that have a non-zero probability of selecting all actions (e.g., ϵ-greedy policies).
By combining policy evaluation and policy improvement steps, we can create a Monte Carlo control algorithm within the Generalized Policy Iteration (GPI) framework.
Monte Carlo Exploring Starts (ES)
Initialize:π(s)∈A(s) (arbitrarily),∀s∈SQ(s,a)∈R (arbitrarily),∀s∈S,a∈A(s)Returns(s,a)← an empty list, ∀s∈S,a∈A(s)Loop forever (for each episode):Choose S0∈S,A0∈A(S0) randomly such that all pairshave probability>0Generate an episode following π:S0,A0,R1,S1,A1,R2,⋯,ST−1,AT−1,RTG←0Loop for each step of episode, t=T−1,T−2,⋯,0:G←γG+Rt+1Unless the pair St,At appears in S0,A0,S1,A1,⋯,St−1,At−1:Append G to Returns(St,At)Q(St,At)←average(Returns(St,At))π(St)←argmaxaQ(St,At)
While Exploring Starts ensure all actions are tried, they may not be practical in many real-world problems. Instead, we can use ϵ-soft policies to maintain exploration.
Definition: An ϵ-soft policy is a stochastic policy where every action has a probability of at least ∣A(s)∣ϵ, ensuring continual exploration.
ϵ-Greedy Policy: A common ϵ-soft policy where the agent chooses the best-known action with probability 1−ϵ and a random action with probability ϵ.
Limitation:ϵ-soft policies may not converge to the true optimal policy but to an optimal ϵ-soft policy.
We can modify our Monte Carlo control algorithm to work with ϵ-soft policies:
On-policy MC control for epsilon-soft policies
Algorithm parameter: small ϵ>0Initialize:π← an arbitrary ϵ-soft policyQ(s,a)∈R (arbitrarily),∀s∈S,a∈A(s)Returns(s,a)← an empty list, ∀s∈S,a∈A(s)Loop forever (for each episode):Generate an episode following π:S0,A0,R1,S1,A1,R2,⋯,ST−1,AT−1,RTG←0Loop for each step of episode, t=T−1,T−2,⋯,0:G←γG+Rt+1Unless the pair St,At appears in S0,A0,S1,A1,⋯,St−1,At−1:Append G to Returns(St,At)Q(St,At)←average(Returns(St,At))A∗←argmaxaQ(St,At)For all a∈A(St):π(a∣St)←{1−ϵ+ϵ/∣A(St)∣ϵ/∣A(St)∣if a=A∗if a=A∗
We can modify the Monte Carlo prediction algorithm to perform off-policy learning:
Off-policy MC prediction (policy evaluation)
Input: an arbitrary target policy πInitialize, for all s∈S,a∈A(s):Q(s,a)∈R (arbitrarily)C(s,a)←0Loop forever (for each episode):b← any policy with coverage of πGenerate an episode following b:S0,A0,R1,S1,A1,R2,⋯,ST−1,AT−1,RTG←0W←1Loop for each step of episode, t=T−1,⋯,0, while W=0:G←γG+Rt+1C(St,At)←C(St,At)+WQ(St,At)←Q(St,At)+C(St,At)W[G−Q(St,At)]W←Wb(At∣St)π(At∣St)
We can extend off-policy learning to control tasks to find the optimal policy.
Off-policy MC control
Initialize, for all s∈S,a∈A(s):Q(s,a)∈R (arbitrarily)C(s,a)←0π(s)←argmaxaQ(s,a)Loop forever (for each episode):b← any soft policyGenerate an episode following b:S0,A0,R1,S1,A1,R2,⋯,ST−1,AT−1,RTG←0W←1Loop for each step of episode, t=T−1,⋯,0:G←γG+Rt+1C(St,At)←C(St,At)+WQ(St,At)←Q(St,At)+C(St,At)W[G−Q(St,At)]π(St)←argmaxaQ(St,a)If At=π(St) then exit inner Loop (to next episode)W←Wb(At∣St)1
While MC methods are model-free and work well with episodic tasks, they rely on complete episodes for updates, limiting their use in continuing tasks. In the next chapter, we'll introduce Temporal-Difference (TD) Learning, a hybrid approach that combines the strengths of Monte Carlo and Dynamic Programming methods. TD methods enable incremental updates after each time step, making them efficient for both episodic and continuing tasks.