On-policy Control with Function Approximation
In the previous page, we explored on-policy prediction with function approximation, focusing on how to estimate value functions using Monte Carlo and TD methods. This page extends those ideas to the control problem, where the goal is to optimize the policy while estimating action-value functions .
Episodic Semi-Gradient Sarsa
To adapt function approximation to the control setting, we need to extend value function updates to action-value functions . The update rule becomes:
where the TD Error is:
This method, called episodic semi-gradient Sarsa, combines the following components:
- Action-Dependent Features: Feature vectors now include both states and actions.
- Policy Improvement: Policies are improved incrementally using -greedy or other soft action-selection strategies.
Input: A differentiable action-value function parameterization
Algorithm parameters: Step size , small
Initialize: Value-function weights arbitrarily (e.g., )
Loop for each episode:
initial state and action of the episode (e.g., -greedy)
Loop for each step of episode:
Take action , observe
If is terminal:
Go to next episode
Choose as a function of (e.g., -greedy)
Semi-Gradient n-step Sarsa
The episodic Sarsa update can be extended to use n-step returns for bootstrapping, allowing the agent to balance short-term and long-term rewards:
n-step methods improve stability and convergence in episodic tasks by integrating rewards across multiple steps, reducing variance compared to one-step updates.
Input: A differentiable action-value function parameterization
Input: A policy (if estimating )
Algorithm parameters: Step size , small , a positive integer
Initialize: Value-function weights arbitrarily (e.g., )
All store and access operations () can take their index mod
Loop for each episode:
Initialize and store terminal
Select and store an action or -greedy wrt
Loop for :
If , then:
Take action
Observe and store the next reward as and the next state as
If is terminal, then:
Else:
Select and store or -greedy wrt
(time whose estimate is being updated)
If :
If , then
Until
Average Reward
In continuing tasks (no terminal states), discounted rewards are often problematic with function approximation. Instead, the average reward objective provides an alternative formulation:
Average Reward Definition
Unlike discounted return formulations, the average reward treats immediate and delayed rewards equally. The policy's performance in the average reward setting is defined as the average reward per time step, calculated as:
This formulation provides a steady-state measure of the agent's performance under policy , where:
- is the steady-state distribution of states under policy . It represents the long-run proportion of time the agent spends in each state.
- is the probability of selecting action in state under policy .
- is the environment's transition probability to state with reward , given action in state .
The existence of the steady-state distribution requires the ergodicity of the Markov Decision Process (MDP). An MDP is ergodic if:
- Irreducibility: It is possible to reach any state from any other state, eventually, under any stationary policy .
- Aperiodicity: The transitions between states do not follow a fixed periodic cycle.
Under these conditions, is well-defined and satisfies the steady-state condition:
This condition ensures that if the agent starts in , it remains in regardless of time. Ergodicity guarantees that the effect of the initial state fades over time, allowing for a stable long-term performance metric.
Differential Return
In the average reward setting, the differential return measures the relative reward accumulated over time compared to the average reward. It is defined as:
where:
- is the reward received steps after time .
- is the average reward under policy .
The differential return removes the bias introduced by the constant average reward , focusing on deviations from this steady-state baseline.
Differential Value Function
The differential value function extends the concept of the differential return to quantify the relative desirability of states or state-action pairs under the average reward setting:
-
State Value Function:
The differential state value function is defined as the expected differential return when starting in state under policy :This represents the long-term relative value of being in state compared to the average reward.
-
Action-Value Function:
The differential action-value function quantifies the expected differential return starting from state and taking action :This function evaluates the relative value of taking action in state under policy .
Bellman Equations for Differential Value Functions
The differential value functions satisfy modified Bellman equations. For a given policy , the equations are:
-
State Value Function:
-
Action-Value Function:
-
Optimal Differential Value Functions:
For optimal policies, the equations are:
Recap
1. What is the primary goal of on-policy control with function approximation?
2. How does episodic semi-gradient Sarsa update the value function weights?
3. What advantage does n-step Sarsa offer over one-step Sarsa?
4. What is the main motivation for using the average reward objective in continuing tasks?
5. What does the differential return represent in the average reward setting?
What's Next
In the next page, we’ll explore off-policy control methods, including Q-learning, with function approximation.