Policy Gradient Methods: Techniques for learning policies directly by optimizing parameterized functions.
The Objective for Learning Policies: Defining objectives for policy optimization.
The Policy Gradient Theorem: A key result for computing policy gradients.
Actor-Critic Methods: Combining policy gradients with value function approximation.
Policy Parameterizations: Implementations using softmax policies for discrete actions and Gaussian policies for continuous actions.
In previous chapters, we've focused on methods that learn value functions to derive optimal policies. Algorithms like Q-Learning and Sarsa estimate action values and then use these estimates to make decisions. In this chapter, we'll explore a different approach: Policy Gradient Methods, which directly parameterize and optimize the policy without the need to estimate value functions for action selection.
In policy gradient methods, we aim to learn a policy that maps states directly to actions without relying on value functions. Instead of deriving the policy indirectly from estimated value functions, we parameterize the policy itself and adjust its parameters to maximize some objective function, typically the expected return.
For example, consider the Mountain Car problem, where an underpowered car must drive up a steep hill. Previously, we used value-based methods with ε-greedy policies derived from estimated action values. However, we can also define a policy that directly maps the car's state (position and velocity) to an action (accelerate left, accelerate right, or do nothing) and adjust the policy parameters to maximize performance.
We represent the policy π(a∣s,θ) using a parameterized function with parameters θ. The policy outputs the probability of taking action a in state s. Our goal is to find the parameters θ that maximize the expected return.
To ensure that π(a∣s,θ) is a valid probability distribution over actions in each state, it must satisfy:
π(a∣s,θ)≥0 for all a∈A(s)
∑a∈A(s)π(a∣s,θ)=1
One common way to parameterize policies is using the softmax function over action preferences.
For continuous action spaces, we can parameterize the policy as a Gaussian distribution.
The policy outputs mean μ(s,θ) and standard deviation σ(s,θ):
π(a∣s,θ)=2πσ(s,θ)21exp(−2σ(s,θ)2(a−μ(s,θ))2)
μ(s,θ) and σ(s,θ) can be parameterized using function approximators (e.g., linear functions, neural networks).
Advantages of Policy Parameterization
Policy gradient methods offer several advantages over value-based methods:
Continuous Actions: They naturally handle continuous action spaces by parameterizing the policy over continuous variables.
Stochastic Policies: They can represent stochastic policies, which may be optimal in certain settings where deterministic policies cannot achieve the best performance due to function approximation limitations.
Autonomous Exploration: The agent can adjust its policy to reduce exploration over time, converging to a deterministic policy if appropriate.
Simpler Policies: In some problems, the optimal policy may be simpler than the optimal value function, making direct policy learning more efficient.
To learn a parameterized policy π(a∣s,θ), we need to define an objective function that quantifies the performance of the policy. A common objective is the expected return.
Since μ(s) and π(a∣s,θ) define the distribution over states and actions encountered while following policy π, we can estimate the gradient using samples (St,At,Rt+1):
∇^θρ(θ)=∇θlnπ(At∣St,θ)q^(St,At)
q^(St,At) is an estimate of qπ(St,At).
We can use Monte Carlo returns or bootstrapped estimates to approximate qπ(St,At).
Reinforce: Monte-Carlo Policy-Gradient Control (episodic)
Input: a differentiable policy parameterization π(a∣s,θ)Algorithm parameter: step size α>0Initialize: policy parameter θ∈Rd′(e.g., to 0)Loop forever (for each episode):Generate an episode S0,A0,R1,⋯,ST−1,AT−1,RT,following π(⋅∣⋅,θ)Loop for each step of episode t=0,1,⋯,T−1:G←∑k=t+1Tγk−t−1Rkθ←θ+αγtG∇lnπ(At∣St,θ)
Subtracting a baseline from the estimated action-value does not introduce bias but can reduce variance:
∇^θρ(θ)=∇θlnπ(At∣St,θ)[q^(St,At)−b(St)]
b(St) can be the estimated state-value vπ(St).
Using the TD error δt simplifies the computation.
Reinforce with Baseline (episodic)
Input: a differentiable policy parameterization π(a∣s,θ)Input: a differentiable state-value function parameterization v^(s,w)Algorithm parameter: step size αθ>0,αw>0Initialize: policy parameter θ∈Rd′ and state-value weights w∈Rd(e.g., to 0)Loop forever (for each episode):Generate an episode S0,A0,R1,⋯,ST−1,AT−1,RT,following π(⋅∣⋅,θ)Loop for each step of episode t=0,1,⋯,T−1:G←∑k=t+1Tγk−t−1Rkδ←G−v^(St,w)w←w+αwδ∇v^(St,w)θ←θ+αθγtδ∇lnπ(At∣St,θ)