7.2 On-policy Control
On-policy Control with Function Approximation
Episodic Sarsa with Function Approximation
To perform control with function approximation, we need to estimate action-value functions :
- Action-Dependent Features: Extend state features to include actions, often by stacking features for each action.
- Episodic Sarsa: Uses the semi-gradient method to update action-value estimates based on the TD error.
Expected Sarsa and Q-Learning with Function Approximation
- Expected Sarsa: Updates action-value estimates using the expected value under the policy.
- Q-Learning: Uses the maximum estimated action value for the next state in its update (off-policy learning).
Exploration under Function Approximation
Exploration strategies need to be adapted for function approximation:
- Optimistic Initialization: Can be challenging due to generalization; initial optimism may decay quickly.
- Epsilon-Greedy: Still applicable; selects random actions with probability to encourage exploration.
Average Reward in Control Problems
Average Reward Objective
In continuing tasks, instead of maximizing the discounted return, we can aim to maximize the average reward per time step:
This formulation treats future rewards equally, avoiding the need for a discount factor.
Differential Value Functions
To apply RL methods in the average reward setting, we define differential value functions:
Differential value functions measure the relative value of states compared to the average reward.
Differential Sarsa Algorithm
An example control algorithm in this setting is Differential Sarsa:
- Estimate Average Reward: Maintain an estimate of the average reward.
- TD Error: Compute .
- Update Weights: Adjust using the TD error.
- Update Average Reward Estimate: Incrementally update based on observed rewards.