On-policy Control with Function Approximation

In the previous page, we explored on-policy prediction with function approximation, focusing on how to estimate value functions using Monte Carlo and TD methods. This page extends those ideas to the control problem, where the goal is to optimize the policy while estimating action-value functions $\hat{q}(s, a, \mathbf{w})$ .

Episodic Semi-Gradient Sarsa

To adapt function approximation to the control setting, we need to extend value function updates to action-value functions $\hat{q}(s, a, \mathbf{w})$ . The update rule becomes:

\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha \delta_t \nabla \hat{q}(S_t, A_t, \mathbf{w}_t),

where the TD Error $\delta_t$ is:

\delta_t = R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t) - \hat{q}(S_t, A_t, \mathbf{w}_t).

This method, called episodic semi-gradient Sarsa, combines the following components:

Action-Dependent Features: Feature vectors now include both states and actions.
Policy Improvement: Policies are improved incrementally using $\epsilon$ -greedy or other soft action-selection strategies.

Episodic Semi-gradient Sarsa

Input: A differentiable action-value function parameterization $\hat{q}: \mathcal{S} \times \mathcal{A} \times \mathbb{R}^d \rightarrow \mathbb{R}$
Algorithm parameters: Step size $\alpha > 0$ , small $\epsilon > 0$
Initialize: Value-function weights $\mathbf{w} \in \mathbb{R}^d$ arbitrarily (e.g., $\mathbf{w} = 0$ )

Loop for each episode:
$S, A \gets$ initial state and action of the episode (e.g., $\epsilon$ -greedy)

    Loop for each step of episode:
        Take action $A$ , observe $R, S'$
        If $S'$ is terminal:
             $\mathbf{w} \gets \mathbf{w} + \alpha \left[ R - \hat{q}(S, A, \mathbf{w}) \right] \nabla \hat{q}(S, A, \mathbf{w})$
            Go to next episode

        Choose $A'$ as a function of $\hat{q}(S', \cdot, \mathbf{w})$ (e.g., $\epsilon$ -greedy)
         $\mathbf{w} \gets \mathbf{w} + \alpha \left[ R + \gamma \hat{q}(S', A', \mathbf{w}) - \hat{q}(S, A, \mathbf{w}) \right] \nabla \hat{q}(S, A, \mathbf{w})$
         $S \gets S'$
         $A \gets A'$

Semi-Gradient n-step Sarsa

The episodic Sarsa update can be extended to use n-step returns for bootstrapping, allowing the agent to balance short-term and long-term rewards:

G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^n \hat{q}(S_{t+n}, A_{t+n}, \mathbf{w}),

\mathbf{w} \leftarrow \mathbf{w} + \alpha \left[ G_{t:t+n} - \hat{q}(S_t, A_t, \mathbf{w}) \right] \nabla \hat{q}(S_t, A_t, \mathbf{w}).

n-step methods improve stability and convergence in episodic tasks by integrating rewards across multiple steps, reducing variance compared to one-step updates.

Episodic Semi-gradient n-step Sarsa

Input: A differentiable action-value function parameterization $\hat{q}: \mathcal{S} \times \mathcal{A} \times \mathbb{R}^d \rightarrow \mathbb{R}$
Input: A policy $\pi$ (if estimating $q_\pi$ )

Algorithm parameters: Step size $\alpha > 0$ , small $\epsilon > 0$ , a positive integer $n$

Initialize: Value-function weights $\mathbf{w} \in \mathbb{R}^d$ arbitrarily (e.g., $\mathbf{w} = 0$ )
All store and access operations ( $S_t, A_t, R_t$ ) can take their index mod $n+1$

Loop for each episode:
Initialize and store $S_0 \neq$ terminal

Select and store an action $A_0 \sim \pi(\cdot | S_0)$ or $\epsilon$ -greedy wrt $\hat{q}(S_0, \cdot, \mathbf{w})$

$T \gets \infty$

    Loop for $t = 0, 1, 2, \dots$ :
        If $t < T$ , then:
            Take action $A_t$
            Observe and store the next reward as $R_{t+1}$ and the next state as $S_{t+1}$

            If $S_{t+1}$ is terminal, then:
                 $T \gets t + 1$
            Else:
                Select and store $A_{t+1} \sim \pi(\cdot | S_{t+1})$ or $\epsilon$ -greedy wrt $\hat{q}(S_{t+1}, \cdot, \mathbf{w})$

$\tau \gets t - n + 1$ (time whose estimate is being updated)

        If $\tau \geq 0$ :
             $G \gets \sum_{i=\tau+1}^{\min(\tau+n, T)} \gamma^{i-\tau-1} R_i$
            If $\tau + n < T$ , then $G \gets G + \gamma^n \hat{q}(S_{\tau+n}, A_{\tau+n}; \mathbf{w})$
             $\mathbf{w} \gets \mathbf{w} + \alpha \left[ G - \hat{q}(S_\tau, A_\tau; \mathbf{w}) \right] \nabla \hat{q}(S_\tau, A_\tau; \mathbf{w})$

Until $\tau = T - 1$

Average Reward

In continuing tasks (no terminal states), discounted rewards are often problematic with function approximation. Instead, the average reward objective provides an alternative formulation:

Average Reward Definition

Unlike discounted return formulations, the average reward treats immediate and delayed rewards equally. The policy's performance in the average reward setting is defined as the average reward per time step, calculated as:

\begin{align*} r(\pi) &\doteq \lim_{H \to \infty} \frac{1}{H} \sum_{t=1}^H \mathbb{E} \left[ R_t \middle| S_0, A_{0:t-1} \sim \pi \right] \\ &= \sum_s \mu_\pi(s) \sum_a \pi(a | s) \sum_{s',r} p(s',r | s,a) r \end{align*}

This formulation provides a steady-state measure of the agent's performance under policy $\pi$ , where:

$\mu_\pi(s)$ is the steady-state distribution of states under policy $\pi$ . It represents the long-run proportion of time the agent spends in each state.
$\pi(a | s)$ is the probability of selecting action $a$ in state $s$ under policy $\pi$ .
$p(s',r | s,a)$ is the environment's transition probability to state $s'$ with reward $r$ , given action $a$ in state $s$ .

The existence of the steady-state distribution $\mu_\pi(s)$ requires the ergodicity of the Markov Decision Process (MDP). An MDP is ergodic if:

Irreducibility: It is possible to reach any state from any other state, eventually, under any stationary policy $\pi$ .
Aperiodicity: The transitions between states do not follow a fixed periodic cycle.

Under these conditions, $\mu_\pi(s)$ is well-defined and satisfies the steady-state condition:

\mu_\pi(s') = \sum_s \mu_\pi(s) \sum_a \pi(a | s) p(s' | s, a)

This condition ensures that if the agent starts in $\mu_\pi$ , it remains in $\mu_\pi$ regardless of time. Ergodicity guarantees that the effect of the initial state fades over time, allowing for a stable long-term performance metric.

Differential Return

In the average reward setting, the differential return measures the relative reward accumulated over time compared to the average reward. It is defined as:

G_t \doteq \sum_{k=0}^\infty \left( R_{t+k+1} - r(\pi) \right),

where:

$R_{t+k+1}$ is the reward received $k$ steps after time $t$ .
$r(\pi)$ is the average reward under policy $\pi$ .

The differential return removes the bias introduced by the constant average reward $r(\pi)$ , focusing on deviations from this steady-state baseline.

Differential Value Function

The differential value function extends the concept of the differential return to quantify the relative desirability of states or state-action pairs under the average reward setting:

State Value Function:
The differential state value function is defined as the expected differential return when starting in state $s$ under policy $\pi$ :
$v_\pi(s) \doteq \mathbb{E}_\pi \left[ G_t \middle| S_t = s \right].$
This represents the long-term relative value of being in state $s$ compared to the average reward.
Action-Value Function:
The differential action-value function quantifies the expected differential return starting from state $s$ and taking action $a$ :
$q_\pi(s, a) \doteq \mathbb{E}_\pi \left[ G_t \middle| S_t = s, A_t = a \right].$
This function evaluates the relative value of taking action $a$ in state $s$ under policy $\pi$ .

Bellman Equations for Differential Value Functions

The differential value functions satisfy modified Bellman equations. For a given policy $\pi$ , the equations are:

State Value Function:
$v_\pi(s) = \sum_a \pi(a | s) \sum_{s',r} p(s',r | s,a) \left[ r - r(\pi) + v_\pi(s') \right].$
Action-Value Function:
$q_\pi(s, a) = \sum_{s',r} p(s',r | s,a) \left[ r - r(\pi) + \sum_{a'} \pi(a' | s') q_\pi(s', a') \right].$
Optimal Differential Value Functions:
For optimal policies, the equations are:
$v_*(s) = \max_a \sum_{s',r} p(s',r | s,a) \left[ r - \max_\pi r(\pi) + v_*(s') \right],$ $q_*(s, a) = \sum_{s',r} p(s',r | s,a) \left[ r - \max_\pi r(\pi) + \max_{a'} q_*(s', a') \right],$

Recap

What's Next

In the next page, we’ll explore off-policy control methods, including Q-learning, with function approximation.

On-policy Control with Function Approximation

Episodic Semi-Gradient Sarsa

Semi-Gradient n-step Sarsa

Average Reward

Average Reward Definition

Differential Return

Differential Value Function

Bellman Equations for Differential Value Functions

Recap

1. What is the primary goal of on-policy control with function approximation?

2. How does episodic semi-gradient Sarsa update the value function weights?

3. What advantage does n-step Sarsa offer over one-step Sarsa?

4. What is the main motivation for using the average reward objective in continuing tasks?

5. What does the differential return represent in the average reward setting?

What's Next

Episodic Semi-Gradient Sarsa​

Semi-Gradient n-step Sarsa​

Average Reward​

Average Reward Definition​

Differential Return​

Differential Value Function​

Bellman Equations for Differential Value Functions​

Recap​

1. What is the primary goal of on-policy control with function approximation?

2. How does episodic semi-gradient Sarsa update the value function weights?

3. What advantage does n-step Sarsa offer over one-step Sarsa?

4. What is the main motivation for using the average reward objective in continuing tasks?

5. What does the differential return represent in the average reward setting?

What's Next​

Episodic Semi-Gradient Sarsa

Semi-Gradient n-step Sarsa

Average Reward

Average Reward Definition

Differential Return

Differential Value Function

Bellman Equations for Differential Value Functions

Recap

What's Next