Skip to main content

Off-policy Methods with Function Approximation

In the previous pages, we discussed on-policy prediction and control methods with function approximation. We focused on semi-gradient TD (for value functions) and semi-gradient Sarsa (for action-value functions), respectively, where the policy generating the data was also the policy being evaluated and improved.

This page explores the off-policy setting, where the policy generating data (behavior policy) may differ from the target policy whose value function (or action-value function) we wish to learn. Off-policy learning can be powerful—it enables learning from demonstrations, replay buffers, or from simultaneously learning about multiple target policies in parallel. However, combining function approximation, bootstrapping, and off-policy data introduces instability risks, known as the “deadly triad.”

We adapt well-known tabular off-policy algorithms (e.g., Q-learning, off-policy Sarsa) to the function-approximation setting. We will also see how importance sampling ratios appear in many off-policy algorithms to correct for discrepancies between the target and behavior policies.

Off-policy Prediction

Semi-Gradient Off-policy TD(0)

The goal of this methods is to evaluate a target policy π\pi for state-value prediction, given experience generated by a behavior policy bb. We have an approximate value function v^(s,w)\hat{v}(s, \mathbf{w}), parameterized by w\mathbf{w}, and we observe transitions (St,At,Rt+1,St+1)(S_t, A_t, R_{t+1}, S_{t+1}) from following bb. The off-policy TD(0) update is:

wt+1wt+αρtδtv^(St,wt),\mathbf{w}_{t+1} \leftarrow \mathbf{w}_t + \alpha \,\rho_t\,\delta_t\,\nabla \hat{v}(S_t, \mathbf{w}_t),

where:

  • ρtπ(AtSt)b(AtSt)\rho_t \,\doteq\, \dfrac{\pi(A_t \mid S_t)}{b(A_t \mid S_t)} is the importance sampling ratio for action AtA_t.
  • δt=Rt+1+γv^(St+1,wt)v^(St,wt)\delta_t = R_{t+1} + \gamma \,\hat{v}(S_{t+1}, \mathbf{w}_t) -\hat{v}(S_t, \mathbf{w}_t) is the TD error.

Because the behavior policy bb may choose actions that π\pi would not (and vice versa), ρt\rho_t appears to adjust the update so it is as if we are learning from π\pi while following bb. The same idea of semi-gradient applies here: the partial derivative is taken only with respect to v^(St,wt)\hat{v}(S_t,\mathbf{w}_t), treating v^(St+1,wt)\hat{v}(S_{t+1}, \mathbf{w}_t) as a constant in the gradient.

Off-policy Control

Q-learning with Function Approximation

Q-learning is a classical off-policy control algorithm for finding an optimal policy. In the tabular case, Q-learning is:

Q(St,At)Q(St,At)+α[Rt+1+γmaxaQ(St+1,a)Q(St,At)]Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha \,\Bigl[ R_{t+1} + \gamma \,\max_{a'}\,Q(S_{t+1}, a') - Q(S_t,A_t) \Bigr]

With function approximation, we represent Q(s,a)q^(s,a,w)Q(s,a) \approx \hat{q}(s,a,\mathbf{w}). The semi-gradient Q-learning update becomes:

wt+1wt+αδtq^(St,At,wt),\mathbf{w}_{t+1} \leftarrow \mathbf{w}_t + \alpha\delta_t\nabla \hat{q}(S_t, A_t, \mathbf{w}_t),

where

δt=Rt+1+γmaxaq^(St+1,a,wt)q^(St,At,wt).\delta_t = R_{t+1} + \gamma \max_{a'}\,\hat{q}(S_{t+1}, a', \mathbf{w}_t) - \hat{q}(S_t, A_t, \mathbf{w}_t).

Note that Q-learning is inherently off-policy: it updates action values as though we always took the best action next time, even if the behavior policy is more exploratory.

Divergence Risk

Unlike tabular Q-learning (which converges under usual conditions), semi-gradient Q-learning can diverge in function-approximation settings if the behavior policy differs significantly from the greedy (target) policy. This is one manifestation of the “deadly triad.”

Off-policy Sarsa (Expected Sarsa)

An alternative off-policy control approach is semi-gradient Expected Sarsa, which replaces the maxaq^\max_{a'}\hat{q} target with an expectation under the target policy. Let π\pi be ϵ\epsilon-greedy wrt q^\hat{q}. Then:

δt=Rt+1+γaπ(a ⁣ ⁣St+1)q^(St+1,a,wt)q^(St,At,wt).\delta_t = R_{t+1} + \gamma \sum_{a'}\,\pi(a'\!\mid\!S_{t+1}) \,\hat{q}(S_{t+1}, a', \mathbf{w}_t) - \hat{q}(S_t,A_t,\mathbf{w}_t).

Emphatic and Gradient-TD Methods

When combining function approximation, bootstrapping, and off-policy data, the distribution mismatch can cause divergence. More advanced algorithms such as:

  • Gradient-TD (e.g., GTD2, TDC): Perform true stochastic gradient descent on the projected Bellman error, requiring an auxiliary parameter vector to keep the updates stable.
  • Emphatic TD: Reweight transitions by a learned “emphasis” MtM_t so that they effectively match an on-policy distribution.

are designed to guarantee convergence under broader conditions, albeit at the cost of higher complexity or variance.

warning

To be included in a future iteration over this chapter:

  • The Deadly Triad
  • Projected Bellman Error (PBE)
  • Gradient-TD methods for minimizing the PBE
  • Emphatic-TD Methods
  • Variance minimization

Recap

1. What distinguishes off-policy learning from on-policy learning in RL?

2. Which factor often causes higher variance in off-policy methods?

3. What is one purpose of the importance sampling ratio ρ in off-policy algorithms?

What's Next

In the next chapter, we'll explore Policy Gradient and Actor-Critic methods, which offer a way to directly learn policies. These methods are particularly useful for handling continuous action spaces or environments with very complex value functions, where traditional value-based approaches may struggle.