Temporal-Difference (TD) Learning: Combining ideas from Monte Carlo and Dynamic Programming methods.
TD(0) Algorithm: The simplest form of TD learning for policy evaluation.
Sarsa and Expected Sarsa:
Q-Learning and Double Q-Learning:
In previous chapters, we've explored Dynamic Programming (DP) methods, which require a complete model of the environment, and Monte Carlo (MC) methods, which learn from complete episodes but don't bootstrap. Temporal-Difference (TD) Learning bridges the gap between these two approaches by learning directly from raw experience without a model and updating estimates based partly on other learned estimates (bootstrapping). In this chapter, we'll see how TD methods learn state-values and policies incrementally and online.
Temporal-Difference (TD) Learning is a fundamental idea in reinforcement learning that combines concepts from both Monte Carlo methods and Dynamic Programming. Like Monte Carlo methods, TD learning can learn directly from raw experience without a model of the environment's dynamics. Like Dynamic Programming, TD methods update estimates based on other learned estimates (bootstrapping).
In TD learning, we update our value estimates based on the difference between consecutive estimates. The key idea is to use the TD error, which represents the difference between the predicted value and a better estimate obtained after observing the next reward and state.
The simplest form of TD learning is the TD(0) algorithm, used for policy evaluation (prediction). The update rule for TD(0) is:
V(St)←V(St)+α[Rt+1+γV(St+1)−V(St)]
V(St) is the current estimate of the value of state St.
Rt+1 is the reward received after taking action At from state St.
γ is the discount factor.
α is the step-size parameter (learning rate).
The term in brackets is the TD errorδt=Rt+1+γV(St+1)−V(St).
In TD(0), the update at time t uses the reward Rt+1 and the estimated value V(St+1) of the next state. This allows the agent to learn online, updating its estimates after each time step, without waiting for the end of an episode as in Monte Carlo methods.
The TD error δt measures the discrepancy between the predicted value V(St) and the target value Rt+1+γV(St+1). By minimizing this error over time, the value function estimates converge to the true values.
Tabular TD(0) Prediction
Input: the policy π to be evaluatedAlgorithm parameter: step size α∈(0,1]Initialize:V(s), for all s∈S+ arbitrarily except that V(terminal)=0Loop for each episode:Initialize SLoop for each step of episode:A←action given π for STake action A, observe R,S′V(S)←V(S)+α[R+γV(S′)−V(S)]S←S′until S is terminal
TD methods have several advantages due to their combination of bootstrapping (like DP) and sampling (like MC):
TD methods do not require a model of the environment's dynamics.
They can learn after each time step, making them suitable for continuous tasks.
They often converge faster than Monte Carlo methods because they update estimates based on learned estimates rather than waiting for final outcomes.
Unlike Monte Carlo methods, which require complete episodes to compute returns, TD methods can learn from incomplete sequences. This makes TD methods applicable to non-episodic (continuing) tasks.
Variance: TD methods typically have lower variance because they rely on bootstrapped estimates.
Bias: TD methods may introduce bias due to bootstrapping but often achieve better performance in practice.
Efficiency: TD methods can be more data-efficient, updating estimates more frequently.
Sarsa is an on-policy TD control algorithm that learns the action-value function Q(s,a) for the policy being followed. The name "Sarsa" comes from the sequence (St,At,Rt+1,St+1,At+1) used in the update.
Sarsa (on-policy TD control)
Algorithm parameter: step size α∈(0,1],small ϵ>0Initialize:Q(s,a), for all s∈S+ arbitrarily except that Q(terminal,⋅)=0Loop for each episode:Initialize SChoose A from S using policy derived from Q(e.g., ϵ-greedy)Loop for each step of episode:Take action A, observe R,S′Choose A′ from S′ using policy derived from Q(e.g., ϵ-greedy)Q(S,A)←Q(S,A)+α[R+γQ(S′,A′)−Q(S,A)]S←S′;A←A′;until S is terminal
Characteristics of Sarsa
On-Policy: Learns the value of the policy being followed, including its exploration behavior.
Exploration: Typically uses ϵ-greedy policies to balance exploration and exploitation.
Convergence: Converges to an optimal policy if all state-action pairs are visited infinitely often and the policy converges to the greedy policy.
Q-Learning is an off-policy TD control algorithm that aims to learn the optimal action-value function Q∗(s,a), independent of the agent's policy. It uses the max over actions in the update, allowing it to learn about the optimal policy while following an exploratory policy.
Q-learning (off-policy TD control)
Algorithm parameter: step size α∈(0,1],small ϵ>0Initialize:Q(s,a), for all s∈S+ arbitrarily except that Q(terminal,⋅)=0Loop for each episode:Initialize SLoop for each step of episode:Choose A from S using policy derived from Q(e.g., ϵ-greedy)Q(S,A)←Q(S,A)+α[R+γamaxQ(S′,a)−Q(S,A)]S←S′until S is terminal
Characteristics of Q-Learning
Off-Policy: Learns about the optimal policy regardless of the agent's actions.
Exploration Policy: The agent can follow any exploration policy, as long as it continues to explore.
Convergence: Converges to the optimal action-value function Q∗(s,a) under certain conditions (e.g., sufficient exploration, decaying learning rate).
Expected Sarsa is a variation of the Sarsa algorithm that replaces the next action's value in the update with the expected value under the current policy. This can reduce variance and potentially improve learning stability.
Reduced Variance: By averaging over all possible actions, Expected Sarsa can have lower variance compared to Sarsa.
Improved Performance: In some cases, Expected Sarsa can learn faster or achieve better performance than Sarsa.
Computational Cost: Calculating the expected value requires summing over all possible actions, which can be computationally intensive if the action space is large.
Policy Representation: The policy must be explicitly known to compute the expected value.
In Q-Learning, the use of the maximum estimated action value in the update can introduce a positive bias, known as maximization bias. This bias occurs because the maximum of estimated values tends to overestimate the true maximum, especially when estimates are noisy.
Double Learning is a technique to mitigate maximization bias by decoupling the selection of actions from the evaluation of those actions.
Double Q-Learning uses two separate value functions, Q1 and Q2, which are updated alternately. The key idea is to use one set of estimates to choose the best action and the other to evaluate it.
Double Q-learning (off-policy TD control)
Algorithm parameter: step size α∈(0,1],small ϵ>0Initialize:Q1(s,a) and Q2(s,a),for all s∈S+ arbitrarily except that Q(terminal,⋅)=0Loop for each episode:Initialize SLoop for each step of episode, :Choose A from S using the policy ϵ-greedy in Q1+Q2Take action A, observe R,S′With 0.5 probability:Q1(S,A)←Q1(S,A)+α[R+γQ2(S′,aargmaxQ1(S′,a))−Q1(S,A)]else:Q2(S,A)←Q2(S,A)+α[R+γQ1(S′,aargmaxQ2(S′,a))−Q2(S,A)]S←S′until S is terminal
Characteristics of Double Sarsa
Reduced Bias: By decoupling action selection and evaluation, Double Q-Learning reduces the overestimation bias.
Improved Performance: Often leads to better performance in environments where maximization bias affects learning.
So far, we've only seen model-free methods, in the next chapter, we’ll explore Planning with Tabular Methods, where agents can leverage models of the environment to simulate experiences. By combining learning from real interactions with simulated ones, agents can improve their policies more efficiently. We’ll also explore the Dyna architecture, a unified framework integrating planning, learning, and acting.