Skip to main content

Temporal-Difference Learning

info

In this chapter, you'll learn about:

  • Temporal-Difference (TD) Learning: Combining ideas from Monte Carlo and Dynamic Programming methods.
  • TD(0) Algorithm: The simplest form of TD learning for policy evaluation.
  • Sarsa and Expected Sarsa:
  • Q-Learning and Double Q-Learning:

In previous chapters, we've explored Dynamic Programming (DP) methods, which require a complete model of the environment, and Monte Carlo (MC) methods, which learn from complete episodes but don't bootstrap. Temporal-Difference (TD) Learning bridges the gap between these two approaches by learning directly from raw experience without a model and updating estimates based partly on other learned estimates (bootstrapping). In this chapter, we'll see how TD methods learn state-values and policies incrementally and online.

Introduction to Temporal-Difference Learning

Temporal-Difference (TD) Learning is a fundamental idea in reinforcement learning that combines concepts from both Monte Carlo methods and Dynamic Programming. Like Monte Carlo methods, TD learning can learn directly from raw experience without a model of the environment's dynamics. Like Dynamic Programming, TD methods update estimates based on other learned estimates (bootstrapping).

In TD learning, we update our value estimates based on the difference between consecutive estimates. The key idea is to use the TD error, which represents the difference between the predicted value and a better estimate obtained after observing the next reward and state.

TD(0) Algorithm

The simplest form of TD learning is the TD(0) algorithm, used for policy evaluation (prediction). The update rule for TD(0) is:

V(St)V(St)+α[Rt+1+γV(St+1)V(St)]V(S_t) \leftarrow V(S_t) + \alpha \left[ R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \right]
  • V(St)V(S_t) is the current estimate of the value of state StS_t.
  • Rt+1R_{t+1} is the reward received after taking action AtA_t from state StS_t.
  • γ\gamma is the discount factor.
  • α\alpha is the step-size parameter (learning rate).
  • The term in brackets is the TD error δt=Rt+1+γV(St+1)V(St)\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t).

In TD(0), the update at time tt uses the reward Rt+1R_{t+1} and the estimated value V(St+1)V(S_{t+1}) of the next state. This allows the agent to learn online, updating its estimates after each time step, without waiting for the end of an episode as in Monte Carlo methods.

The TD error δt\delta_t measures the discrepancy between the predicted value V(St)V(S_t) and the target value Rt+1+γV(St+1)R_{t+1} + \gamma V(S_{t+1}). By minimizing this error over time, the value function estimates converge to the true values.

Tabular TD(0) Prediction
Input: the policy π to be evaluatedAlgorithm parameter: step size α(0,1]Initialize:V(s), for all sS+ arbitrarily except that V(terminal)=0Loop for each episode:Initialize SLoop for each step of episode:Aaction given π for STake action A, observe R,SV(S)V(S)+α[R+γV(S)V(S)]SSuntil S is terminal\begin{align*} &\textbf{Input:}\text{ the policy } \pi \text{ to be evaluated} \\ &\textbf{Algorithm parameter:}\text{ step size } \alpha \in ( 0,1 ] \\ &\textbf{Initialize:}\\ &\qquad V(s), \text{ for all } s \in \mathcal{S}^+ \text{ arbitrarily except that } V(\text{terminal})=0\\[1em] &\textbf{Loop for each episode:} \\ &\qquad \textbf{Initialize } S\\ &\qquad \textbf{Loop for each } \text{step of episode:}\\ &\qquad\qquad A \leftarrow \text{action given } \pi \text{ for } S \\ &\qquad\qquad \text{Take action } A \text{, observe } R,S' \\ &\qquad\qquad V(S) \leftarrow V(S) + \alpha [R+\gamma V(S') - V(S)] \\ &\qquad\qquad S \leftarrow S' \\ &\qquad \textbf{until } S \text{ is terminal} \\ \end{align*}

TD methods have several advantages due to their combination of bootstrapping (like DP) and sampling (like MC):

  • TD methods do not require a model of the environment's dynamics.
  • They can learn after each time step, making them suitable for continuous tasks.
  • They often converge faster than Monte Carlo methods because they update estimates based on learned estimates rather than waiting for final outcomes.

Unlike Monte Carlo methods, which require complete episodes to compute returns, TD methods can learn from incomplete sequences. This makes TD methods applicable to non-episodic (continuing) tasks.

  • Variance: TD methods typically have lower variance because they rely on bootstrapped estimates.
  • Bias: TD methods may introduce bias due to bootstrapping but often achieve better performance in practice.
  • Efficiency: TD methods can be more data-efficient, updating estimates more frequently.

Sarsa: On-policy TD Control

Sarsa is an on-policy TD control algorithm that learns the action-value function Q(s,a)Q(s,a) for the policy being followed. The name "Sarsa" comes from the sequence (St,At,Rt+1,St+1,At+1)(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1}) used in the update.

Sarsa (on-policy TD control)
Algorithm parameter: step size α(0,1],small ϵ>0Initialize:Q(s,a), for all sS+ arbitrarily except that Q(terminal,)=0Loop for each episode:Initialize SChoose A from S using policy derived from Q(e.g., ϵ-greedy)Loop for each step of episode:Take action A, observe R,SChoose A from S using policy derived from Q(e.g., ϵ-greedy)Q(S,A)Q(S,A)+α[R+γQ(S,A)Q(S,A)]SS;AA;until S is terminal\begin{align*} &\textbf{Algorithm parameter:}\text{ step size } \alpha \in ( 0,1 ], \text{small } \epsilon > 0 \\ &\textbf{Initialize:}\\ &\qquad Q(s,a), \text{ for all } s \in \mathcal{S}^+ \text{ arbitrarily except that } Q(\text{terminal}, \cdot)=0\\[1em] &\textbf{Loop for each episode:} \\ &\qquad \textbf{Initialize } S\\ &\qquad \textbf{Choose } A \text{ from } S \text{ using policy derived from } Q (\text{e.g., } \epsilon \text{-greedy})\\ &\qquad \textbf{Loop for each } \text{step of episode}:\\ &\qquad\qquad \text{Take action } A \text{, observe } R,S' \\ &\qquad\qquad \textbf{Choose } A' \text{ from } S' \text{ using policy derived from } Q (\text{e.g., } \epsilon \text{-greedy})\\ &\qquad\qquad Q(S,A) \leftarrow Q(S,A) + \alpha [R+\gamma Q(S',A') - Q(S,A)] \\ &\qquad\qquad S \leftarrow S'; A \leftarrow A';\\ &\qquad \textbf{until } S \text{ is terminal} \\ \end{align*}
Characteristics of Sarsa
  • On-Policy: Learns the value of the policy being followed, including its exploration behavior.
  • Exploration: Typically uses ϵ\epsilon-greedy policies to balance exploration and exploitation.
  • Convergence: Converges to an optimal policy if all state-action pairs are visited infinitely often and the policy converges to the greedy policy.

Q-learning: Off-policy TD Control

Q-Learning is an off-policy TD control algorithm that aims to learn the optimal action-value function Q(s,a)Q^*(s,a), independent of the agent's policy. It uses the max over actions in the update, allowing it to learn about the optimal policy while following an exploratory policy.

Q-learning (off-policy TD control)
Algorithm parameter: step size α(0,1],small ϵ>0Initialize:Q(s,a), for all sS+ arbitrarily except that Q(terminal,)=0Loop for each episode:Initialize SLoop for each step of episode:Choose A from S using policy derived from Q(e.g., ϵ-greedy)Q(S,A)Q(S,A)+α[R+γmaxaQ(S,a)Q(S,A)]SSuntil S is terminal\begin{align*} &\textbf{Algorithm parameter:}\text{ step size } \alpha \in ( 0,1 ], \text{small } \epsilon > 0 \\ &\textbf{Initialize:}\\ &\qquad Q(s,a), \text{ for all } s \in \mathcal{S}^+ \text{ arbitrarily except that } Q(\text{terminal}, \cdot)=0\\[1em] &\textbf{Loop for each episode:} \\ &\qquad \textbf{Initialize } S\\ &\qquad \textbf{Loop for each } \text{step of episode:}\\ &\qquad\qquad \textbf{Choose } A \text{ from } S \text{ using policy derived from } Q (\text{e.g., } \epsilon \text{-greedy})\\ &\qquad\qquad Q(S,A) \leftarrow Q(S,A) + \alpha [R+\gamma \max_aQ(S',a) - Q(S,A)] \\ &\qquad\qquad S \leftarrow S'\\ &\qquad \textbf{until } S \text{ is terminal} \\ \end{align*}
Characteristics of Q-Learning
  • Off-Policy: Learns about the optimal policy regardless of the agent's actions.
  • Exploration Policy: The agent can follow any exploration policy, as long as it continues to explore.
  • Convergence: Converges to the optimal action-value function Q(s,a)Q^*(s,a) under certain conditions (e.g., sufficient exploration, decaying learning rate).

Expected Sarsa

Expected Sarsa is a variation of the Sarsa algorithm that replaces the next action's value in the update with the expected value under the current policy. This can reduce variance and potentially improve learning stability.

The update rule for Expected Sarsa is:

Q(St,At)Q(St,At)+α[Rt+1+γEa[Q(St+1,a)St+1]Q(St,At)]Q(St,At)+α[Rt+1+γaπ(aSt+1)Q(St+1,a)Q(St,At)]\begin{align*} Q(S_t, A_t) &\leftarrow Q(S_t, A_t) + \alpha \Bigg [ R_{t+1} + \gamma \mathbb{E}_{a} [ Q(S_{t+1}, a) | S_{t+1} ] - Q(S_t, A_t) \Bigg]\\ &\leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma \sum_{a} \pi(a|S_{t+1}) Q(S_{t+1}, a) - Q(S_t, A_t) \right] \end{align*}
Characteristics of Expected Sarsa
  • Reduced Variance: By averaging over all possible actions, Expected Sarsa can have lower variance compared to Sarsa.
  • Improved Performance: In some cases, Expected Sarsa can learn faster or achieve better performance than Sarsa.
  • Computational Cost: Calculating the expected value requires summing over all possible actions, which can be computationally intensive if the action space is large.
  • Policy Representation: The policy must be explicitly known to compute the expected value.

Maximization Bias and Double Learning

In Q-Learning, the use of the maximum estimated action value in the update can introduce a positive bias, known as maximization bias. This bias occurs because the maximum of estimated values tends to overestimate the true maximum, especially when estimates are noisy.

Double Learning is a technique to mitigate maximization bias by decoupling the selection of actions from the evaluation of those actions.

Double Q-Learning uses two separate value functions, Q1Q_1 and Q2Q_2, which are updated alternately. The key idea is to use one set of estimates to choose the best action and the other to evaluate it.

Double Q-learning (off-policy TD control)
Algorithm parameter: step size α(0,1],small ϵ>0Initialize:Q1(s,a) and Q2(s,a),for all sS+ arbitrarily except that Q(terminal,)=0Loop for each episode:Initialize SLoop for each step of episode, :Choose A from S using the policy ϵ-greedy in Q1+Q2Take action A, observe R,SWith 0.5 probability:Q1(S,A)Q1(S,A)+α[R+γQ2(S,arg maxaQ1(S,a))Q1(S,A)]else:Q2(S,A)Q2(S,A)+α[R+γQ1(S,arg maxaQ2(S,a))Q2(S,A)]SSuntil S is terminal\begin{align*} &\textbf{Algorithm parameter:}\text{ step size } \alpha \in ( 0,1 ], \text{small } \epsilon > 0 \\ &\textbf{Initialize:}\\ &\qquad Q_1(s,a) \text{ and } Q_2(s,a),\\ &\qquad \text{for all } s \in \mathcal{S}^+ \text{ arbitrarily except that } Q(\text{terminal}, \cdot)=0\\[1em] &\textbf{Loop for each episode:} \\ &\qquad \textbf{Initialize } S\\ &\qquad \textbf{Loop for each } \text{step of episode, }:\\ &\qquad\qquad \textbf{Choose } A \text{ from } S \text{ using the policy } \epsilon \text{-greedy in } Q_1+Q_2\\ &\qquad\qquad \textbf{Take action } A \text{, observe } R, S'\\ &\qquad\qquad \textbf{With } 0.5 \text{ probability:}\\ &\qquad\qquad\qquad Q_1(S,A) \leftarrow Q_1(S,A) + \alpha [R+\gamma Q_2(S', \argmax_a Q_1(S',a)) - Q_1(S,A)] \\ &\qquad\qquad \textbf{else:}\\ &\qquad\qquad\qquad Q_2(S,A) \leftarrow Q_2(S,A) + \alpha [R+\gamma Q_1(S', \argmax_a Q_2(S',a)) - Q_2(S,A)] \\ &\qquad\qquad S \leftarrow S'\\ &\qquad \textbf{until } S \text{ is terminal} \\ \end{align*}
Characteristics of Double Sarsa
  • Reduced Bias: By decoupling action selection and evaluation, Double Q-Learning reduces the overestimation bias.
  • Improved Performance: Often leads to better performance in environments where maximization bias affects learning.

Recap

1. What distinguishes Temporal-Difference (TD) Learning from Monte Carlo (MC) methods?

2. What is the TD error in the TD(0) algorithm?

3. What is a key characteristic of the Sarsa algorithm?

4. What is the main distinction between Sarsa and Q-Learning in terms of policy?

5. What is the primary benefit of Double Q-Learning over Q-Learning?

What's Next

So far, we've only seen model-free methods, in the next chapter, we’ll explore Planning with Tabular Methods, where agents can leverage models of the environment to simulate experiences. By combining learning from real interactions with simulated ones, agents can improve their policies more efficiently. We’ll also explore the Dyna architecture, a unified framework integrating planning, learning, and acting.