On-policy Control with Function Approximation
In the previous page, we explored on-policy prediction with function approximation, focusing on how to estimate value functions using Monte Carlo and TD methods. This page extends those ideas to the control problem, where the goal is to optimize the policy while estimating action-value functions .
Episodic Semi-Gradient Sarsa
To adapt function approximation to the control setting, we need to extend value function updates to action-value functions . The update rule becomes:
where the TD Error is:
This method, called episodic semi-gradient Sarsa, combines the following components:
- Action-Dependent Features: Feature vectors now include both states and actions.
- Policy Improvement: Policies are improved incrementally using -greedy or other soft action-selection strategies.
Input: A differentiable action-value function parameterization
Algorithm parameters: Step size , small
Initialize: Value-function weights arbitrarily (e.g., )
Loop for each episode:
initial state and action of the episode (e.g., -greedy)
Loop for each step of episode:
Take action , observe
If is terminal:
Go to next episode
Choose as a function of