Skip to main content

Policy Gradient Methods

info

In this chapter, you'll learn about:

  • Policy Gradient Methods: Techniques for learning policies directly by optimizing parameterized functions.
  • The Objective for Learning Policies: Defining objectives for policy optimization.
  • The Policy Gradient Theorem: A key result for computing policy gradients.
  • Actor-Critic Methods: Combining policy gradients with value function approximation.
  • Policy Parameterizations: Implementations using softmax policies for discrete actions and Gaussian policies for continuous actions.

In previous chapters, we've focused on methods that learn value functions to derive optimal policies. Algorithms like Q-Learning and Sarsa estimate action values and then use these estimates to make decisions. In this chapter, we'll explore a different approach: Policy Gradient Methods, which directly parameterize and optimize the policy without the need to estimate value functions for action selection.

Introduction to Policy Gradient Methods

In policy gradient methods, we aim to learn a policy that maps states directly to actions without relying on value functions. Instead of deriving the policy indirectly from estimated value functions, we parameterize the policy itself and adjust its parameters to maximize some objective function, typically the expected return.

For example, consider the Mountain Car problem, where an underpowered car must drive up a steep hill. Previously, we used value-based methods with ε-greedy policies derived from estimated action values. However, we can also define a policy that directly maps the car's state (position and velocity) to an action (accelerate left, accelerate right, or do nothing) and adjust the policy parameters to maximize performance.

Parameterizing Policies

We represent the policy π(as,θ)\pi(a | s, \boldsymbol{\theta}) using a parameterized function with parameters θ\boldsymbol{\theta}. The policy outputs the probability of taking action aa in state ss. Our goal is to find the parameters θ\boldsymbol{\theta} that maximize the expected return.

To ensure that π(as,θ)\pi(a | s, \boldsymbol{\theta}) is a valid probability distribution over actions in each state, it must satisfy:

  • π(as,θ)0\pi(a | s, \boldsymbol{\theta}) \geq 0 for all aA(s)a \in \mathcal{A}(s)
  • aA(s)π(as,θ)=1\sum_{a \in \mathcal{A}(s)} \pi(a | s, \boldsymbol{\theta}) = 1

One common way to parameterize policies is using the softmax function over action preferences.

Softmax Policy

The softmax policy assigns probabilities to actions based on their preferences h(s,a,θ)h(s, a, \boldsymbol{\theta}):

π(as,θ)=eh(s,a,θ)bA(s)eh(s,b,θ)\pi(a | s, \boldsymbol{\theta}) = \frac{e^{h(s, a, \boldsymbol{\theta})}}{\sum_{b \in \mathcal{A}(s)} e^{h(s, b, \boldsymbol{\theta})}}
  • The preferences h(s,a,θ)h(s, a, \boldsymbol{\theta}) can be any real-valued function parameterized by θ\boldsymbol{\theta}.
  • The exponential function ensures that all probabilities are positive.
  • The denominator normalizes the probabilities so they sum to one.

Gaussian Policy

For continuous action spaces, we can parameterize the policy as a Gaussian distribution.

The policy outputs mean μ(s,θ)\mu(s, \boldsymbol{\theta}) and standard deviation σ(s,θ)\sigma(s, \boldsymbol{\theta}):

π(as,θ)=12πσ(s,θ)2exp((aμ(s,θ))22σ(s,θ)2)\pi(a | s, \boldsymbol{\theta}) = \frac{1}{\sqrt{2\pi \sigma(s, \boldsymbol{\theta})^2}} \exp\left( -\frac{(a - \mu(s, \boldsymbol{\theta}))^2}{2\sigma(s, \boldsymbol{\theta})^2} \right)
  • μ(s,θ)\mu(s, \boldsymbol{\theta}) and σ(s,θ)\sigma(s, \boldsymbol{\theta}) can be parameterized using function approximators (e.g., linear functions, neural networks).
Advantages of Policy Parameterization

Policy gradient methods offer several advantages over value-based methods:

  1. Continuous Actions: They naturally handle continuous action spaces by parameterizing the policy over continuous variables.
  2. Stochastic Policies: They can represent stochastic policies, which may be optimal in certain settings where deterministic policies cannot achieve the best performance due to function approximation limitations.
  3. Autonomous Exploration: The agent can adjust its policy to reduce exploration over time, converging to a deterministic policy if appropriate.
  4. Simpler Policies: In some problems, the optimal policy may be simpler than the optimal value function, making direct policy learning more efficient.

The Objective for Learning Policies

To learn a parameterized policy π(as,θ)\pi(a | s, \boldsymbol{\theta}), we need to define an objective function that quantifies the performance of the policy. A common objective is the expected return.

Expected Return Objective

For continuing tasks, we can define the average reward per time step under policy π\pi:

J(θ)r(π)sSμ(s)aA(s)π(as,θ)sSp(ss,a)  r(s,a,s)J(\boldsymbol{\theta}) \doteq r(\pi) \doteq \sum_{s \in \mathcal{S}} \mu(s) \sum_{a \in \mathcal{A}(s)} \pi(a | s, \boldsymbol{\theta}) \sum_{s' \in \mathcal{S}} p(s' | s, a) \; r(s, a, s')
  • μ(s)\mu(s) is the stationary distribution over states induced by policy π\pi.
  • p(ss,a)p(s' | s, a) is the transition probability from state ss to ss' given action aa.
  • r(s,a,s)r(s, a, s') is the expected reward when transitioning from ss to ss' using action aa.

Our goal is to find θ\boldsymbol{\theta}^* that maximizes J(θ)J(\boldsymbol{\theta}):

θ=argmaxθJ(θ)\boldsymbol{\theta}^* = \arg\max_{\boldsymbol{\theta}} J(\boldsymbol{\theta})

Policy Gradient Approach

We can use gradient ascent to find θ\boldsymbol{\theta}^*:

θk+1=θk+αθJ(θk)\boldsymbol{\theta}_{k+1} = \boldsymbol{\theta}_k + \alpha \nabla_\theta J(\boldsymbol{\theta}_k)
  • α\alpha is the learning rate.
  • θJ(θ)\nabla_\theta J(\boldsymbol{\theta}) is the gradient of the expected return with respect to the policy parameters.

However, computing θJ(θ)\nabla_\theta J(\boldsymbol{\theta}) directly is challenging due to the dependence of μ(s)\mu(s) on θ\boldsymbol{\theta}.

The Policy Gradient Theorem

The Policy Gradient Theorem provides a way to compute the gradient θJ(θ)\nabla_\theta J(\boldsymbol{\theta}) without needing to differentiate μ(s)\mu(s).

For any differentiable policy π(as,θ)\pi(a | s, \boldsymbol{\theta}), the gradient of the expected return is:

θJ(θ)sSμ(s)aA(s)qπ(s,a)θπ(as,θ)\nabla_\theta J(\boldsymbol{\theta}) \propto \sum_{s \in \mathcal{S}} \mu(s) \sum_{a \in \mathcal{A}(s)} q_\pi(s, a) \nabla_\theta \pi(a | s, \boldsymbol{\theta})
  • qπ(s,a)q_\pi(s, a) is the action-value function under policy π\pi.
  • In the episodic case, the gradient is proportional (denoted by the symbol \propto) to the average length of the episode. In the continuing case, the relationship is an equality.

Moreover, using the identity θπ(as,θ)=π(as,θ)θlnπ(as,θ)\nabla_\theta \pi(a | s, \boldsymbol{\theta}) = \pi(a | s, \boldsymbol{\theta}) \nabla_\theta \ln \pi(a | s, \boldsymbol{\theta}), we can write:

θJ(θ)=sSμ(s)aA(s)π(as,θ)θlnπ(as,θ)qπ(s,a)\nabla_\theta J(\boldsymbol{\theta}) = \sum_{s \in \mathcal{S}} \mu(s) \sum_{a \in \mathcal{A}(s)} \pi(a | s, \boldsymbol{\theta}) \nabla_\theta \ln \pi(a | s, \boldsymbol{\theta}) q_\pi(s, a)
  • This expression allows us to estimate the gradient using samples collected while following policy π\pi.
  • The dependence of μ(s)\mu(s) on θ\boldsymbol{\theta} cancels out, simplifying the computation.
  • The gradient consists of an expectation over states and actions encountered under the policy.

Estimating the Policy Gradient

Since μ(s)\mu(s) and π(as,θ)\pi(a | s, \boldsymbol{\theta}) define the distribution over states and actions encountered while following policy π\pi, we can estimate the gradient using samples (St,At,Rt+1)(S_t, A_t, R_{t+1}):

^θJ(θ)=θlnπ(AtSt,θ)q^(St,At)\hat{\nabla}_\theta J(\boldsymbol{\theta}) = \nabla_\theta \ln \pi(A_t | S_t, \boldsymbol{\theta}) \hat{q}(S_t, A_t)
  • q^(St,At)\hat{q}(S_t, A_t) is an estimate of qπ(St,At)q_\pi(S_t, A_t).

We can use Monte Carlo returns approximate qπ(St,At)q_\pi(S_t, A_t):

Reinforce: Monte-Carlo Policy-Gradient Control (episodic)
Input: a differentiable policy parameterization π(as,θ)Algorithm parameter: step size α>0Initialize: policy parameter θRd(e.g., to 0)Loop forever (for each episode):Generate an episode S0,A0,R1,,ST1,AT1,RT,following π(,θ)Loop for each step of episode t=0,1,,T1:Gk=t+1Tγkt1Rkθθ+αγtGlnπ(AtSt,θ)\begin{align*} &\textbf{Input:}\text{ a differentiable policy parameterization } \pi(a|s,\bm{\theta}) \\ &\textbf{Algorithm parameter:}\text{ step size } \alpha > 0 \\ &\textbf{Initialize:} \text{ policy parameter } \bm{\theta} \in \mathbb{R}^{d'} \text{(e.g., to \textbf{0})}\\[1em] &\textbf{Loop forever (for each episode):} \\ &\qquad \textbf{Generate an episode }\\ &\qquad\quad S_0, A_0, R_1, \cdots, S_{T-1}, A_{T-1}, R_T, \text{following } \pi(\cdot|\cdot,\bm{\theta}) \\ &\qquad \textbf{Loop for each } \text{step of episode } t=0,1, \cdots, T-1:\\ &\qquad\qquad G \leftarrow \textstyle \sum^T_{k=t+1} \gamma^{k-t-1}R_k\\ &\qquad\qquad \bm{\theta} \leftarrow \bm{\theta} + \alpha \gamma^t G \nabla \ln \pi(A_t|S_t, \bm{\theta}) \end{align*}

However, REINFORCE may have a slow learning due to the potencial high variance of Monte Carlo methods.

Baselines to Reduce Variance

Subtracting a baseline from the estimated action-value does not introduce bias but can reduce variance:

^θJ(θ)=θlnπ(AtSt,θ)[q^(St,At)b(St)]\hat{\nabla}_\theta J(\boldsymbol{\theta}) = \nabla_\theta \ln \pi(A_t | S_t, \boldsymbol{\theta}) [\hat{q}(S_t, A_t) - b(S_t)]
  • b(St)b(S_t) can be the estimated state-value vπ(St)v_\pi(S_t), or any other function independent of aa. This ensures we don't introduce bias:
ab(s)θπ(AtSt,θ)=b(s)θaπ(AtSt,θ)=b(s)θ1=0\sum_a b(s) \nabla_{\theta} \pi(A_t | S_t, \boldsymbol{\theta}) = b(s) \nabla_{\theta} \sum_a \pi(A_t | S_t, \boldsymbol{\theta}) = b(s) \nabla_{\theta} 1 = 0
Reinforce with Baseline (episodic)
Input: a differentiable policy parameterization π(as,θ)Input: a differentiable state-value function parameterization v^(s,w)Algorithm parameter: step size αθ>0,αw>0Initialize: policy parameter θRd and state-value weights wRd(e.g., to 0)Loop forever (for each episode):Generate an episode S0,A0,R1,,ST1,AT1,RT,following π(,θ)Loop for each step of episode t=0,1,,T1:Gk=t+1Tγkt1RkδGv^(St,w)ww+αwδv^(St,w)θθ+αθγtδlnπ(AtSt,θ)\begin{align*} &\textbf{Input:}\text{ a differentiable policy parameterization } \pi(a|s,\bm{\theta}) \\ &\textbf{Input:}\text{ a differentiable state-value function parameterization } \hat{v}(s,\bm{\textbf{w}}) \\ &\textbf{Algorithm parameter:}\text{ step size } \alpha^{\bm{\theta}} > 0, \alpha^{\bm{\textbf{w}}} > 0 \\ &\textbf{Initialize:} \text{ policy parameter } \bm{\theta} \in \mathbb{R}^{d'} \text{ and} \\ &\quad \text{ state-value weights } \textbf{w} \in \mathbb{R}^d \text{(e.g., to \textbf{0})}\\[1em] &\textbf{Loop forever (for each episode):} \\ &\qquad \textbf{Generate an episode }\\ &\qquad\quad S_0, A_0, R_1, \cdots, S_{T-1}, A_{T-1}, R_T, \text{following } \pi(\cdot|\cdot,\bm{\theta}) \\ &\qquad \textbf{Loop for each } \text{step of episode } t=0,1, \cdots, T-1:\\ &\qquad\qquad G \leftarrow \textstyle \sum^T_{k=t+1} \gamma^{k-t-1}R_k\\ &\qquad\qquad \delta \leftarrow G - \hat{v}(S_t, \textbf{w})\\ &\qquad\qquad \textbf{w} \leftarrow \textbf{w} + \alpha^{\textbf{w}} \delta \nabla \hat{v}(S_t, \textbf{w})\\ &\qquad\qquad \bm{\theta} \leftarrow \bm{\theta} + \alpha^{\bm{\theta}} \gamma^t \delta \nabla \ln \pi(A_t|S_t, \bm{\theta}) \end{align*}

Actor-Critic Methods

Actor-Critic methods combine policy gradient updates (the actor) with value function approximation (the critic) to estimate qπ(s,a)q_\pi(s, a) or vπ(s)v_\pi(s).

  • Actor: The policy π(as,θ)\pi(a | s, \boldsymbol{\theta}) that selects actions and is updated in the direction suggested by the critic.
  • Critic: Estimates the value function vπ(s)v_\pi(s) or qπ(s,a)q_\pi(s, a), used to evaluate the actions taken by the actor.
Advantages
  • Stability: The critic provides a baseline for the actor, reducing variance in the updates.
  • Efficiency: Learning both the policy and value function can lead to faster convergence.
One-step Actor-Critic (episodic)
Input: a differentiable policy parameterization π(as,θ)Input: a differentiable state-value function parameterization v^(s,w)Algorithm parameter: step size αθ>0,αw>0Initialize: policy parameter θRd and state-value weights wRd(e.g., to 0)Loop forever (for each episode):Initialize S (first state of episode)I1Loop while S is not terminal (for each time step):Aπ(S,θ)Take action A, observe S,RδR+γv^(S,w)v^(S,w)(if S is terminal, then v^(S,w)0)ww+αwδv^(S,w)θθ+αθγtIδlnπ(AS,θ)IγISS\begin{align*} &\textbf{Input:}\text{ a differentiable policy parameterization } \pi(a|s,\bm{\theta}) \\ &\textbf{Input:}\text{ a differentiable state-value function parameterization } \hat{v}(s,\bm{\textbf{w}}) \\ &\textbf{Algorithm parameter:}\text{ step size } \alpha^{\bm{\theta}} > 0, \alpha^{\bm{\textbf{w}}} > 0 \\ &\textbf{Initialize:} \text{ policy parameter } \bm{\theta} \in \mathbb{R}^{d'} \text{ and} \\ &\quad \text{ state-value weights } \textbf{w} \in \mathbb{R}^d \text{(e.g., to \textbf{0})}\\[1em] &\textbf{Loop forever (for each episode):} \\ &\qquad \textbf{Initialize } S \text{ (first state of episode)} \\ &\qquad I \leftarrow 1 \\ &\qquad \textbf{Loop while \(S\) is not terminal (for each time step):} \\ &\qquad\qquad A \sim \pi(\cdot|S,\boldsymbol{\theta})\\ &\qquad\qquad \text{Take action } A, \text{ observe } S', R\\ &\qquad\qquad \delta \leftarrow R + \gamma \hat{v}(S', \textbf{w}) - \hat{v}(S, \textbf{w}) \qquad \text{(if \(S'\) is terminal, then \(\hat{v}(S', \textbf{w}) \doteq 0\))}\\ &\qquad\qquad \textbf{w} \leftarrow \textbf{w} + \alpha^{\textbf{w}} \delta \nabla \hat{v}(S, \textbf{w})\\ &\qquad\qquad \bm{\theta} \leftarrow \bm{\theta} + \alpha^{\bm{\theta}} \gamma^t I \delta \nabla \ln \pi(A|S, \bm{\theta})\\ &\qquad\qquad I \leftarrow \gamma I\\ &\qquad\qquad S \leftarrow S' \end{align*}

Recap

1. What is the main difference between policy gradient methods and value-based methods?

2. What does the softmax function ensure in policy parameterization?

3. What is the primary role of the policy gradient theorem?

4. Why is a baseline used in the policy gradient update?

What's Next

In this introduction, we covered the foundations of Reinforcement Learning (RL)—from multi-armed bandits and Markov Decision Processes to dynamic programming, Monte Carlo methods, temporal-difference learning, planning, and policy gradients. With these we set the stage for more advanced methods that tackle complex, real-world problems.

Next, we will explore:

  • (SOON) Multi-step bootstrapping:
    Including n-step returns and eligibility traces, we’ll see how multi-step methods bridge the gap between Monte Carlo and single-step temporal-difference approaches, offering a flexible trade-off between bias and variance.

  • (SOON) Temporal abstraction:
    Through the options and eigenoptions frameworks, we’ll expand an agent’s decision-making capabilities by introducing higher-level actions (or “subpolicies”), which enable more efficient planning and hierarchical learning.

  • (SOON) Deep Reinforcement Learning:
    By combining RL with neural networks, we’ll explore how agents can scale to high-dimensional inputs, discussing seminal advances like Deep Q-Networks (DQN) as well as emerging methods in deep RL.