Policy Gradient Methods: Techniques for learning policies directly by optimizing parameterized functions.
The Objective for Learning Policies: Defining objectives for policy optimization.
The Policy Gradient Theorem: A key result for computing policy gradients.
Actor-Critic Methods: Combining policy gradients with value function approximation.
Policy Parameterizations: Implementations using softmax policies for discrete actions and Gaussian policies for continuous actions.
In previous chapters, we've focused on methods that learn value functions to derive optimal policies. Algorithms like Q-Learning and Sarsa estimate action values and then use these estimates to make decisions. In this chapter, we'll explore a different approach: Policy Gradient Methods, which directly parameterize and optimize the policy without the need to estimate value functions for action selection.
In policy gradient methods, we aim to learn a policy that maps states directly to actions without relying on value functions. Instead of deriving the policy indirectly from estimated value functions, we parameterize the policy itself and adjust its parameters to maximize some objective function, typically the expected return.
For example, consider the Mountain Car problem, where an underpowered car must drive up a steep hill. Previously, we used value-based methods with ε-greedy policies derived from estimated action values. However, we can also define a policy that directly maps the car's state (position and velocity) to an action (accelerate left, accelerate right, or do nothing) and adjust the policy parameters to maximize performance.
We represent the policy π(a∣s,θ) using a parameterized function with parameters θ. The policy outputs the probability of taking action a in state s. Our goal is to find the parameters θ that maximize the expected return.
To ensure that π(a∣s,θ) is a valid probability distribution over actions in each state, it must satisfy:
π(a∣s,θ)≥0 for all a∈A(s)
∑a∈A(s)π(a∣s,θ)=1
One common way to parameterize policies is using the softmax function over action preferences.
For continuous action spaces, we can parameterize the policy as a Gaussian distribution.
The policy outputs mean μ(s,θ) and standard deviation σ(s,θ):
π(a∣s,θ)=2πσ(s,θ)21exp(−2σ(s,θ)2(a−μ(s,θ))2)
μ(s,θ) and σ(s,θ) can be parameterized using function approximators (e.g., linear functions, neural networks).
Advantages of Policy Parameterization
Policy gradient methods offer several advantages over value-based methods:
Continuous Actions: They naturally handle continuous action spaces by parameterizing the policy over continuous variables.
Stochastic Policies: They can represent stochastic policies, which may be optimal in certain settings where deterministic policies cannot achieve the best performance due to function approximation limitations.
Autonomous Exploration: The agent can adjust its policy to reduce exploration over time, converging to a deterministic policy if appropriate.
Simpler Policies: In some problems, the optimal policy may be simpler than the optimal value function, making direct policy learning more efficient.
To learn a parameterized policy π(a∣s,θ), we need to define an objective function that quantifies the performance of the policy. A common objective is the expected return.
The Policy Gradient Theorem provides a way to compute the gradient ∇θJ(θ) without needing to differentiate μ(s).
For any differentiable policy π(a∣s,θ), the gradient of the expected return is:
∇θJ(θ)∝s∈S∑μ(s)a∈A(s)∑qπ(s,a)∇θπ(a∣s,θ)
qπ(s,a) is the action-value function under policy π.
In the episodic case, the gradient is proportional (denoted by the symbol ∝) to the average length of the episode. In the continuing case, the relationship is an equality.
Moreover, using the identity ∇θπ(a∣s,θ)=π(a∣s,θ)∇θlnπ(a∣s,θ), we can write:
Since μ(s) and π(a∣s,θ) define the distribution over states and actions encountered while following policy π, we can estimate the gradient using samples (St,At,Rt+1):
∇^θJ(θ)=∇θlnπ(At∣St,θ)q^(St,At)
q^(St,At) is an estimate of qπ(St,At).
We can use Monte Carlo returns approximate qπ(St,At):
Reinforce: Monte-Carlo Policy-Gradient Control (episodic)
Input: a differentiable policy parameterization π(a∣s,θ)Algorithm parameter: step size α>0Initialize: policy parameter θ∈Rd′(e.g., to 0)Loop forever (for each episode):Generate an episode S0,A0,R1,⋯,ST−1,AT−1,RT,following π(⋅∣⋅,θ)Loop for each step of episode t=0,1,⋯,T−1:G←∑k=t+1Tγk−t−1Rkθ←θ+αγtG∇lnπ(At∣St,θ)
However, REINFORCE may have a slow learning due to the potencial high variance of Monte Carlo methods.
Input: a differentiable policy parameterization π(a∣s,θ)Input: a differentiable state-value function parameterization v^(s,w)Algorithm parameter: step size αθ>0,αw>0Initialize: policy parameter θ∈Rd′ and state-value weights w∈Rd(e.g., to 0)Loop forever (for each episode):Generate an episode S0,A0,R1,⋯,ST−1,AT−1,RT,following π(⋅∣⋅,θ)Loop for each step of episode t=0,1,⋯,T−1:G←∑k=t+1Tγk−t−1Rkδ←G−v^(St,w)w←w+αwδ∇v^(St,w)θ←θ+αθγtδ∇lnπ(At∣St,θ)
Actor-Critic methods combine policy gradient updates (the actor) with value function approximation (the critic) to estimate qπ(s,a) or vπ(s).
Actor: The policy π(a∣s,θ) that selects actions and is updated in the direction suggested by the critic.
Critic: Estimates the value function vπ(s) or qπ(s,a), used to evaluate the actions taken by the actor.
Advantages
Stability: The critic provides a baseline for the actor, reducing variance in the updates.
Efficiency: Learning both the policy and value function can lead to faster convergence.
One-step Actor-Critic (episodic)
Input: a differentiable policy parameterization π(a∣s,θ)Input: a differentiable state-value function parameterization v^(s,w)Algorithm parameter: step size αθ>0,αw>0Initialize: policy parameter θ∈Rd′ and state-value weights w∈Rd(e.g., to 0)Loop forever (for each episode):Initialize S (first state of episode)I←1Loop while S is not terminal (for each time step):A∼π(⋅∣S,θ)Take action A, observe S′,Rδ←R+γv^(S′,w)−v^(S,w)(if S′ is terminal, then v^(S′,w)≐0)w←w+αwδ∇v^(S,w)θ←θ+αθγtIδ∇lnπ(A∣S,θ)I←γIS←S′
In this introduction, we covered the foundations of Reinforcement Learning (RL)—from multi-armed bandits and Markov Decision Processes to dynamic programming, Monte Carlo methods, temporal-difference learning, planning, and policy gradients. With these we set the stage for more advanced methods that tackle complex, real-world problems.
Next, we will explore:
(SOON) Multi-step bootstrapping:
Including n-step returns and eligibility traces, we’ll see how multi-step methods bridge the gap between Monte Carlo and single-step temporal-difference approaches, offering a flexible trade-off between bias and variance.
(SOON) Temporal abstraction:
Through the options and eigenoptions frameworks, we’ll expand an agent’s decision-making capabilities by introducing higher-level actions (or “subpolicies”), which enable more efficient planning and hierarchical learning.
(SOON) Deep Reinforcement Learning:
By combining RL with neural networks, we’ll explore how agents can scale to high-dimensional inputs, discussing seminal advances like Deep Q-Networks (DQN) as well as emerging methods in deep RL.