Off-policy Methods with Function Approximation

In the previous pages, we discussed on-policy prediction and control methods with function approximation. We focused on semi-gradient TD (for value functions) and semi-gradient Sarsa (for action-value functions), respectively, where the policy generating the data was also the policy being evaluated and improved.

This page explores the off-policy setting, where the policy generating data (behavior policy) may differ from the target policy whose value function (or action-value function) we wish to learn. Off-policy learning can be powerful—it enables learning from demonstrations, replay buffers, or from simultaneously learning about multiple target policies in parallel. However, combining function approximation, bootstrapping, and off-policy data introduces instability risks, known as the “deadly triad.”

We adapt well-known tabular off-policy algorithms (e.g., Q-learning, off-policy Sarsa) to the function-approximation setting. We will also see how importance sampling ratios appear in many off-policy algorithms to correct for discrepancies between the target and behavior policies.

Off-policy Prediction

Semi-Gradient Off-policy TD(0)

The goal of this methods is to evaluate a target policy $\pi$ for state-value prediction, given experience generated by a behavior policy $b$ . We have an approximate value function $\hat{v}(s, \mathbf{w})$ , parameterized by $\mathbf{w}$ , and we observe transitions $(S_t, A_t, R_{t+1}, S_{t+1})$ from following $b$ . The off-policy TD(0) update is:

\mathbf{w}_{t+1} \leftarrow \mathbf{w}_t + \alpha \,\rho_t\,\delta_t\,\nabla \hat{v}(S_t, \mathbf{w}_t),

where:

$\rho_t \,\doteq\, \dfrac{\pi(A_t \mid S_t)}{b(A_t \mid S_t)}$ is the importance sampling ratio for action $A_t$ .
$\delta_t = R_{t+1} + \gamma \,\hat{v}(S_{t+1}, \mathbf{w}_t) -\hat{v}(S_t, \mathbf{w}_t)$ is the TD error.

Because the behavior policy $b$ may choose actions that $\pi$ would not (and vice versa), $\rho_t$ appears to adjust the update so it is as if we are learning from $\pi$ while following $b$ . The same idea of semi-gradient applies here: the partial derivative is taken only with respect to $\hat{v}(S_t,\mathbf{w}_t)$ , treating $\hat{v}(S_{t+1}, \mathbf{w}_t)$ as a constant in the gradient.

Off-policy Control

Q-learning with Function Approximation

Q-learning is a classical off-policy control algorithm for finding an optimal policy. In the tabular case, Q-learning is:

Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha \,\Bigl[ R_{t+1} + \gamma \,\max_{a'}\,Q(S_{t+1}, a') - Q(S_t,A_t) \Bigr]

With function approximation, we represent $Q(s,a) \approx \hat{q}(s,a,\mathbf{w})$ . The semi-gradient Q-learning update becomes:

\mathbf{w}_{t+1} \leftarrow \mathbf{w}_t + \alpha\delta_t\nabla \hat{q}(S_t, A_t, \mathbf{w}_t),

where

\delta_t = R_{t+1} + \gamma \max_{a'}\,\hat{q}(S_{t+1}, a', \mathbf{w}_t) - \hat{q}(S_t, A_t, \mathbf{w}_t).

Note that Q-learning is inherently off-policy: it updates action values as though we always took the best action next time, even if the behavior policy is more exploratory.

Divergence Risk

Unlike tabular Q-learning (which converges under usual conditions), semi-gradient Q-learning can diverge in function-approximation settings if the behavior policy differs significantly from the greedy (target) policy. This is one manifestation of the “deadly triad.”

Off-policy Sarsa (Expected Sarsa)

An alternative off-policy control approach is semi-gradient Expected Sarsa, which replaces the $\max_{a'}\hat{q}$ target with an expectation under the target policy. Let $\pi$ be $\epsilon$ -greedy wrt $\hat{q}$ . Then:

\delta_t = R_{t+1} + \gamma \sum_{a'}\,\pi(a'\!\mid\!S_{t+1}) \,\hat{q}(S_{t+1}, a', \mathbf{w}_t) - \hat{q}(S_t,A_t,\mathbf{w}_t).

Emphatic and Gradient-TD Methods

When combining function approximation, bootstrapping, and off-policy data, the distribution mismatch can cause divergence. More advanced algorithms such as:

Gradient-TD (e.g., GTD2, TDC): Perform true stochastic gradient descent on the projected Bellman error, requiring an auxiliary parameter vector to keep the updates stable.
Emphatic TD: Reweight transitions by a learned “emphasis” $M_t$ so that they effectively match an on-policy distribution.

are designed to guarantee convergence under broader conditions, albeit at the cost of higher complexity or variance.

warning

To be included in a future iteration over this chapter:

The Deadly Triad
Projected Bellman Error (PBE)
Gradient-TD methods for minimizing the PBE
Emphatic-TD Methods
Variance minimization

Recap

What's Next

In the next chapter, we'll explore Policy Gradient and Actor-Critic methods, which offer a way to directly learn policies. These methods are particularly useful for handling continuous action spaces or environments with very complex value functions, where traditional value-based approaches may struggle.

Off-policy Prediction
- Semi-Gradient Off-policy TD(0)
Off-policy Control
- Q-learning with Function Approximation
- Off-policy Sarsa (Expected Sarsa)
Emphatic and Gradient-TD Methods
Recap
What's Next

Off-policy Methods with Function Approximation

Off-policy Prediction

Semi-Gradient Off-policy TD(0)

Off-policy Control

Q-learning with Function Approximation

Off-policy Sarsa (Expected Sarsa)

Emphatic and Gradient-TD Methods

Recap

1. What distinguishes off-policy learning from on-policy learning in RL?

2. Which factor often causes higher variance in off-policy methods?

3. What is one purpose of the importance sampling ratio ρ in off-policy algorithms?

What's Next

Off-policy Prediction​

Semi-Gradient Off-policy TD(0)​

Off-policy Control​

Q-learning with Function Approximation​

Off-policy Sarsa (Expected Sarsa)​

Emphatic and Gradient-TD Methods​

Recap​

1. What distinguishes off-policy learning from on-policy learning in RL?

2. Which factor often causes higher variance in off-policy methods?

3. What is one purpose of the importance sampling ratio ρ in off-policy algorithms?

What's Next​

Off-policy Prediction

Semi-Gradient Off-policy TD(0)

Off-policy Control

Q-learning with Function Approximation

Off-policy Sarsa (Expected Sarsa)

Emphatic and Gradient-TD Methods

Recap

What's Next