Skip to main content

On-policy Control with Function Approximation

In the previous page, we explored on-policy prediction with function approximation, focusing on how to estimate value functions using Monte Carlo and TD methods. This page extends those ideas to the control problem, where the goal is to optimize the policy while estimating action-value functions q^(s,a,w)\hat{q}(s, a, \mathbf{w}).

Episodic Semi-Gradient Sarsa

To adapt function approximation to the control setting, we need to extend value function updates to action-value functions q^(s,a,w)\hat{q}(s, a, \mathbf{w}). The update rule becomes:

wt+1wt+αδtq^(St,At,wt),\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha \delta_t \nabla \hat{q}(S_t, A_t, \mathbf{w}_t),

where the TD Error δt\delta_t is:

δt=Rt+1+γq^(St+1,At+1,wt)q^(St,At,wt).\delta_t = R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t) - \hat{q}(S_t, A_t, \mathbf{w}_t).

This method, called episodic semi-gradient Sarsa, combines the following components:

  • Action-Dependent Features: Feature vectors now include both states and actions.
  • Policy Improvement: Policies are improved incrementally using ϵ\epsilon-greedy or other soft action-selection strategies.
Episodic Semi-gradient Sarsa

Input: A differentiable action-value function parameterization q^:S×A×RdR\hat{q}: \mathcal{S} \times \mathcal{A} \times \mathbb{R}^d \rightarrow \mathbb{R}
Algorithm parameters: Step size α>0\alpha > 0, small ϵ>0\epsilon > 0
Initialize: Value-function weights wRd\mathbf{w} \in \mathbb{R}^d arbitrarily (e.g., w=0\mathbf{w} = 0)

Loop for each episode:
    S,AS, A \gets initial state and action of the episode (e.g., ϵ\epsilon-greedy)

    Loop for each step of episode:
        Take action AA, observe R,SR, S'
        If SS' is terminal:
            ww+α[Rq^(S,A,w)]q^(S,A,w)\mathbf{w} \gets \mathbf{w} + \alpha \left[ R - \hat{q}(S, A, \mathbf{w}) \right] \nabla \hat{q}(S, A, \mathbf{w})
            Go to next episode

        Choose AA' as a function of q^(S,,w)\hat{q}(S', \cdot, \mathbf{w}) (e.g., ϵ\epsilon-greedy)
        ww+α[R+γq^(S,A,w)q^(S,A,w)]q^(S,A,w)\mathbf{w} \gets \mathbf{w} + \alpha \left[ R + \gamma \hat{q}(S', A', \mathbf{w}) - \hat{q}(S, A, \mathbf{w}) \right] \nabla \hat{q}(S, A, \mathbf{w})
        SSS \gets S'
        AAA \gets A'

Semi-Gradient n-step Sarsa

The episodic Sarsa update can be extended to use n-step returns for bootstrapping, allowing the agent to balance short-term and long-term rewards:

Gt:t+n=Rt+1+γRt+2++γnq^(St+n,At+n,w),G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^n \hat{q}(S_{t+n}, A_{t+n}, \mathbf{w}), ww+α[Gt:t+nq^(St,At,w)]q^(St,At,w).\mathbf{w} \leftarrow \mathbf{w} + \alpha \left[ G_{t:t+n} - \hat{q}(S_t, A_t, \mathbf{w}) \right] \nabla \hat{q}(S_t, A_t, \mathbf{w}).

n-step methods improve stability and convergence in episodic tasks by integrating rewards across multiple steps, reducing variance compared to one-step updates.

Episodic Semi-gradient n-step Sarsa

Input: A differentiable action-value function parameterization q^:S×A×RdR\hat{q}: \mathcal{S} \times \mathcal{A} \times \mathbb{R}^d \rightarrow \mathbb{R}
Input: A policy π\pi (if estimating qπq_\pi)

Algorithm parameters: Step size α>0\alpha > 0, small ϵ>0\epsilon > 0, a positive integer nn

Initialize: Value-function weights wRd\mathbf{w} \in \mathbb{R}^d arbitrarily (e.g., w=0\mathbf{w} = 0)
All store and access operations (St,At,RtS_t, A_t, R_t) can take their index mod n+1n+1

Loop for each episode:
    Initialize and store S0S_0 \neq terminal

    Select and store an action A0π(S0)A_0 \sim \pi(\cdot | S_0) or ϵ\epsilon-greedy wrt q^(S0,,w)\hat{q}(S_0, \cdot, \mathbf{w})

    TT \gets \infty

    Loop for t=0,1,2,t = 0, 1, 2, \dots:
        If t<Tt < T, then:
            Take action AtA_t
            Observe and store the next reward as Rt+1R_{t+1} and the next state as St+1S_{t+1}

            If St+1S_{t+1} is terminal, then:
                Tt+1T \gets t + 1
            Else:
                Select and store At+1π(St+1)A_{t+1} \sim \pi(\cdot | S_{t+1}) or ϵ\epsilon-greedy wrt q^(St+1,,w)\hat{q}(S_{t+1}, \cdot, \mathbf{w})

        τtn+1\tau \gets t - n + 1 (time whose estimate is being updated)

        If τ0\tau \geq 0:
            Gi=τ+1min(τ+n,T)γiτ1RiG \gets \sum_{i=\tau+1}^{\min(\tau+n, T)} \gamma^{i-\tau-1} R_i
            If τ+n<T\tau + n < T, then GG+γnq^(Sτ+n,Aτ+n;w)G \gets G + \gamma^n \hat{q}(S_{\tau+n}, A_{\tau+n}; \mathbf{w})
            ww+α[Gq^(Sτ,Aτ;w)]q^(Sτ,Aτ;w)\mathbf{w} \gets \mathbf{w} + \alpha \left[ G - \hat{q}(S_\tau, A_\tau; \mathbf{w}) \right] \nabla \hat{q}(S_\tau, A_\tau; \mathbf{w})

    Until τ=T1\tau = T - 1

Average Reward

In continuing tasks (no terminal states), discounted rewards are often problematic with function approximation. Instead, the average reward objective provides an alternative formulation:

Average Reward Definition

Unlike discounted return formulations, the average reward treats immediate and delayed rewards equally. The policy's performance in the average reward setting is defined as the average reward per time step, calculated as:

r(π)limH1Ht=1HE[Rt|S0,A0:t1π]=sμπ(s)aπ(as)s,rp(s,rs,a)r\begin{align*} r(\pi) &\doteq \lim_{H \to \infty} \frac{1}{H} \sum_{t=1}^H \mathbb{E} \left[ R_t \middle| S_0, A_{0:t-1} \sim \pi \right] \\ &= \sum_s \mu_\pi(s) \sum_a \pi(a | s) \sum_{s',r} p(s',r | s,a) r \end{align*}

This formulation provides a steady-state measure of the agent's performance under policy π\pi, where:

  • μπ(s)\mu_\pi(s) is the steady-state distribution of states under policy π\pi. It represents the long-run proportion of time the agent spends in each state.
  • π(as)\pi(a | s) is the probability of selecting action aa in state ss under policy π\pi.
  • p(s,rs,a)p(s',r | s,a) is the environment's transition probability to state ss' with reward rr, given action aa in state ss.

The existence of the steady-state distribution μπ(s)\mu_\pi(s) requires the ergodicity of the Markov Decision Process (MDP). An MDP is ergodic if:

  1. Irreducibility: It is possible to reach any state from any other state, eventually, under any stationary policy π\pi.
  2. Aperiodicity: The transitions between states do not follow a fixed periodic cycle.

Under these conditions, μπ(s)\mu_\pi(s) is well-defined and satisfies the steady-state condition:

μπ(s)=sμπ(s)aπ(as)p(ss,a)\mu_\pi(s') = \sum_s \mu_\pi(s) \sum_a \pi(a | s) p(s' | s, a)

This condition ensures that if the agent starts in μπ\mu_\pi, it remains in μπ\mu_\pi regardless of time. Ergodicity guarantees that the effect of the initial state fades over time, allowing for a stable long-term performance metric.

Differential Return

In the average reward setting, the differential return measures the relative reward accumulated over time compared to the average reward. It is defined as:

Gtk=0(Rt+k+1r(π)),G_t \doteq \sum_{k=0}^\infty \left( R_{t+k+1} - r(\pi) \right),

where:

  • Rt+k+1R_{t+k+1} is the reward received kk steps after time tt.
  • r(π)r(\pi) is the average reward under policy π\pi.

The differential return removes the bias introduced by the constant average reward r(π)r(\pi), focusing on deviations from this steady-state baseline.

Differential Value Function

The differential value function extends the concept of the differential return to quantify the relative desirability of states or state-action pairs under the average reward setting:

  1. State Value Function:
    The differential state value function is defined as the expected differential return when starting in state ss under policy π\pi:

    vπ(s)Eπ[Gt|St=s].v_\pi(s) \doteq \mathbb{E}_\pi \left[ G_t \middle| S_t = s \right].

    This represents the long-term relative value of being in state ss compared to the average reward.

  2. Action-Value Function:
    The differential action-value function quantifies the expected differential return starting from state ss and taking action aa:

    qπ(s,a)Eπ[Gt|St=s,At=a].q_\pi(s, a) \doteq \mathbb{E}_\pi \left[ G_t \middle| S_t = s, A_t = a \right].

    This function evaluates the relative value of taking action aa in state ss under policy π\pi.

Bellman Equations for Differential Value Functions

The differential value functions satisfy modified Bellman equations. For a given policy π\pi, the equations are:

  1. State Value Function:

    vπ(s)=aπ(as)s,rp(s,rs,a)[rr(π)+vπ(s)].v_\pi(s) = \sum_a \pi(a | s) \sum_{s',r} p(s',r | s,a) \left[ r - r(\pi) + v_\pi(s') \right].
  2. Action-Value Function:

    qπ(s,a)=s,rp(s,rs,a)[rr(π)+aπ(as)qπ(s,a)].q_\pi(s, a) = \sum_{s',r} p(s',r | s,a) \left[ r - r(\pi) + \sum_{a'} \pi(a' | s') q_\pi(s', a') \right].
  3. Optimal Differential Value Functions:
    For optimal policies, the equations are:

    v(s)=maxas,rp(s,rs,a)[rmaxπr(π)+v(s)],v_*(s) = \max_a \sum_{s',r} p(s',r | s,a) \left[ r - \max_\pi r(\pi) + v_*(s') \right], q(s,a)=s,rp(s,rs,a)[rmaxπr(π)+maxaq(s,a)],q_*(s, a) = \sum_{s',r} p(s',r | s,a) \left[ r - \max_\pi r(\pi) + \max_{a'} q_*(s', a') \right],

Recap

1. What is the primary goal of on-policy control with function approximation?

2. How does episodic semi-gradient Sarsa update the value function weights?

3. What advantage does n-step Sarsa offer over one-step Sarsa?

4. What is the main motivation for using the average reward objective in continuing tasks?

5. What does the differential return represent in the average reward setting?

What's Next

In the next page, we’ll explore off-policy control methods, including Q-learning, with function approximation.