Off-policy Methods with Function Approximation
In the previous pages, we discussed on-policy prediction and control methods with function approximation. We focused on semi-gradient TD (for value functions) and semi-gradient Sarsa (for action-value functions), respectively, where the policy generating the data was also the policy being evaluated and improved.
This page explores the off-policy setting, where the policy generating data (behavior policy) may differ from the target policy whose value function (or action-value function) we wish to learn. Off-policy learning can be powerful—it enables learning from demonstrations, replay buffers, or from simultaneously learning about multiple target policies in parallel. However, combining function approximation, bootstrapping, and off-policy data introduces instability risks, known as the “deadly triad.”
We adapt well-known tabular off-policy algorithms (e.g., Q-learning, off-policy Sarsa) to the function-approximation setting. We will also see how importance sampling ratios appear in many off-policy algorithms to correct for discrepancies between the target and behavior policies.
Off-policy Prediction
Semi-Gradient Off-policy TD(0)
The goal of this methods is to evaluate a target policy for state-value prediction, given experience generated by a behavior policy . We have an approximate value function , parameterized by , and we observe transitions from following . The off-policy TD(0) update is:
where:
- is the importance sampling ratio for action .
- is the TD error.
Because the behavior policy may choose actions that would not (and vice versa), appears to adjust the update so it is as if we are learning from while following . The same idea of semi-gradient applies here: the partial derivative is taken only with respect to , treating as a constant in the gradient.
Off-policy Control
Q-learning with Function Approximation
Q-learning is a classical off-policy control algorithm for finding an optimal policy. In the tabular case, Q-learning is:
With function approximation, we represent . The semi-gradient Q-learning update becomes:
where
Note that Q-learning is inherently off-policy: it updates action values as though we always took the best action next time, even if the behavior policy is more exploratory.
Unlike tabular Q-learning (which converges under usual conditions), semi-gradient Q-learning can diverge in function-approximation settings if the behavior policy differs significantly from the greedy (target) policy. This is one manifestation of the “deadly triad.”
Off-policy Sarsa (Expected Sarsa)
An alternative off-policy control approach is semi-gradient Expected Sarsa, which replaces the target with an expectation under the target policy. Let be -greedy wrt . Then:
Emphatic and Gradient-TD Methods
When combining function approximation, bootstrapping, and off-policy data, the distribution mismatch can cause divergence. More advanced algorithms such as:
- Gradient-TD (e.g., GTD2, TDC): Perform true stochastic gradient descent on the projected Bellman error, requiring an auxiliary parameter vector to keep the updates stable.
- Emphatic TD: Reweight transitions by a learned “emphasis” so that they effectively match an on-policy distribution.
are designed to guarantee convergence under broader conditions, albeit at the cost of higher complexity or variance.
To be included in a future iteration over this chapter:
- The Deadly Triad
- Projected Bellman Error (PBE)
- Gradient-TD methods for minimizing the PBE
- Emphatic-TD Methods
- Variance minimization
Recap
1. What distinguishes off-policy learning from on-policy learning in RL?
2. Which factor often causes higher variance in off-policy methods?
3. What is one purpose of the importance sampling ratio ρ in off-policy algorithms?
What's Next
In the next chapter, we'll explore Policy Gradient and Actor-Critic methods, which offer a way to directly learn policies. These methods are particularly useful for handling continuous action spaces or environments with very complex value functions, where traditional value-based approaches may struggle.