Policy Gradients and Advantage Estimation

Work in Progress

These are my rough notes.

Policy Gradient derivation

In policy gradient methods, the agent is really a policy $\pi_\theta(u|s)$ that chooses action $u$ given the current input $s$ . For us, it will be a neural network, with parameters $\theta$ .

Figure 1: Actor-environment interaction in policy gradient methods.

Control policy parameterized by parameter vector $\theta$ :

\underset{\theta}{max} \ \mathbb E[\displaystyle\sum_{t=0}^HR(s_t|\pi_\theta)]

Stochastic policy class (smooths out the problem):

\pi_\theta(u|s): \text{probability of action u in state s}

Why Policy Optimization

Often $\pi$ $π$ can be simpler than $Q$ $Q$ or $V$ $V$ .
- E.g., robotic grasp.
$V$ $V$ : doesn't prescribe actions.
- Would need a dynamics model to look ahead against the value function.
$Q$ $Q$ : need to be able to efficiently solve $\underset{u}{argmax} \ Q_{\theta}(s,u)$ $u a r g ma x Q_{θ} (s, u)$ .
- The challenge for continuous/high-dimensional action spaces.

Likelihood Ration Policy Gradient

We let $\tau$ denote a state-action sequence $s_0, u_o, …, s_H, u_H$ , so we overload notation: $R(\tau)=\sum_{t=0}^HR(s_t, u_t)$ . Meaning, the reward for the trajectory $\tau$ is a sum of the rewards for each state-action pair.

We optimize for the utility, for the parameter vector $\theta$ :

U(\theta)=\mathbb E[\displaystyle\sum_{t=0}^HR(s_t,u_t;\pi_\theta)] = \displaystyle\sum_{\tau}P(\tau;\theta)R(\tau)

The sum of the rewards is weighted by its probability, given the parameters $\theta$ , for every possible trajectory.

Our goal is to find $\theta$ :

\underset{\theta}{max} \ U(\theta)=\underset{\theta}{max}\displaystyle\sum_{\tau}P(\tau;\theta)R(\tau)

Some settings favor some trajectories, so we have to find the settings that maximize the expected reward.

Taking the gradient of $U(\theta)$ with relation to $\theta$ gives:

\begin{align*} \nabla_\theta U(\theta) &= \nabla_\theta \displaystyle\sum_{\tau}P(\tau;\theta)R(\tau) \\ &= \displaystyle\sum_{\tau}\nabla_\theta P(\tau;\theta)R(\tau) \end{align*}

\begin{align*} &= \displaystyle\sum_{\tau}\frac{P(\tau;\theta)}{P(\tau;\theta)}\nabla_\theta P(\tau;\theta)R(\tau) \\ &= \displaystyle\sum_{\tau}P(\tau;\theta)\frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)}R(\tau) \\ &= \displaystyle\sum_{\tau}P(\tau;\theta)\nabla_\theta \log P(\tau;\theta)R(\tau) \\ \end{align*}

We have an expectation so we can compute a sample-based approximation. We don't have to enumerate all possible trajectories, which would be impossible in realistic problems.

Approximate with the empirical estimate for $m$ sample paths under policy $\pi_\theta$ :

\nabla_\theta U(\theta) \approx \displaystyle \hat{g} = \frac{1}{m}\displaystyle\sum_{i=1}^m \nabla_\theta \log P(\tau^{(i)};\theta)R(\tau^{(i)})

So, to get our gradient estimate, we're gonna use our current setting of the parameters, do a bunch of rollouts, and then for each rollout, we're going to compute grad log probability of the trajectory under the current parameter setting times the reward collected along that trajectory.

The intuition is that by controlling parameter setting $\theta$ we can increase the probability of high reward trajectories, and decrease low reward trajectories.

Untitled

Temporal Decomposition

Let's decompose the path into states and actions:

\nabla_\theta\log P(\tau^{(i)}; \theta)=\nabla_\theta \log \bigg [ \displaystyle\prod_{t=0}^H \underbrace{P(s_{t+1}^{(i)}|s_t^{(i)}, u_t^{(i)})}_{\text{dynamics model}} \cdot \underbrace{\pi_\theta(u_t^{(i)}|s_t^{(i)})}_{\text{policy}} \bigg]

The grad log probability of trajectory under the current parameter vector $\theta$ of the policy under the hood is a product of probabilities, the probability of the next state, given current state and action, and the probability of action given state.

\begin{align*} \nabla_\theta\log P(\tau^{(i)}; \theta) &= \nabla_\theta \log \bigg [ \displaystyle\prod_{t=0}^H \underbrace{P(s_{t+1}^{(i)}|s_t^{(i)}, u_t^{(i)})}_{\text{dynamics model}} \cdot \underbrace{\pi_\theta(u_t^{(i)}|s_t^{(i)})}_{\text{policy}} \bigg] \\ &= \nabla_\theta \bigg [ \displaystyle\sum_{t=0}^H \log P(s_{t+1}^{(i)}|s_t^{(i)}, u_t^{(i)}) + \displaystyle\sum_{t=0}^H \log \pi_\theta(u_t^{(i)}|s_t^{(i)}) \bigg] \\ &= \nabla_\theta \displaystyle\sum_{t=0}^H \log \pi_\theta(u_t^{(i)}|s_t^{(i)}) \\ &= \displaystyle\sum_{t=0}^H \underbrace{\nabla_\theta\log \pi_\theta(u_t^{(i)}|s_t^{(i)})}_{\text{no dynamics model required!}} \\ \end{align*}

Because the first sum doesn't have a dependency on $\theta$ , the gradient has no dynamics model contribution.

Likelihood Ratio Gradient Estimate

Estimate:

\hat{g} = \frac{1}{m} \displaystyle\sum_{i=1}^m \nabla_\theta \log P(\tau^{(i)};\theta)R(\tau^{(i)})

And the first part we discovered that we can compute without access to a dynamics model, we can do this directly based on the neural network that represents our policy.

\nabla_\theta\log P(\tau^{(i)}; \theta) = \displaystyle\sum_{t=0}^H \underbrace{\nabla_\theta\log \pi_\theta(u_t^{(i)}|s_t^{(i)})}_{\text{no dynamics model required!}}

We can rollout the current policy, then get the rewards along each trajectory, and we know the state and actions along these trajectories. So for each state and action experienced, we compute grad log probability of the action given state and then accumulate the grad log probability of the trajectory, multiply with your reward along that trajectory, and you have your gradient.

An unbiased estimate of the gradient:

\mathbb E[ \hat{g}] = \nabla_\theta U(\theta)

As formulated thus far: unbiased but very noisy. It's sample-based and if you don't have enough samples, it's not going to be very precise. Fixes that lead to real-world practicality:
- Baseline
- Temporal Structure
- And next lecture: Trust region / natural gradient

Likelihood Ratio Gradient: Intuition

Gradient tries to:

Increase the log probability of trajectories that are above average, regarding $R$
Decrease the log probability of trajectories that are below average, regarding $R$

Untitled

Baseline Subtraction

We have the gradient:

\nabla_\theta U(\theta) \approx \displaystyle \hat{g} = \frac{1}{m}\displaystyle\sum_{i=1}^m \nabla_\theta \log P(\tau^{(i)};\theta)R(\tau^{(i)})

Now consider baseline $b$ :

\begin{align*} \nabla_\theta U(\theta) \approx \displaystyle \hat{g} = \frac{1}{m}\displaystyle\sum_{i=1}^m \nabla_\theta \log P(\tau^{(i)};\theta)(R(\tau^{(i)})~-&~b) \\ &\uparrow \\ \text{still unbiased} &\text{ [Williams 1992]} \end{align*}

We subtract the average reward of the rollout of all trajectories $b$ .

Can we do that?

\begin{align*} \mathbb E[\nabla_\theta \log P(\tau;\theta)~b] &=\displaystyle\sum_\tau P(\tau;\theta) \nabla_\theta \log P(\tau;\theta)~b \\ &=\displaystyle\sum_\tau P(\tau;\theta) \frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)} ~b \\ &=\displaystyle\sum_\tau \nabla_\theta P(\tau;\theta)~b \\ &=\nabla_\theta \bigg (\displaystyle\sum_\tau P(\tau)~b \bigg ) \\ &=b \nabla_\theta \bigg (\displaystyle\sum_\tau P(\tau) \bigg ) \\ &\hookrightarrow \text{As long as baseline doesn't depend on action in logprob (action)} \\ &=b \times0 \end{align*}

On expectation, the extra term will be zero. So why care to add the term $b$ ? When there are finite samples, the estimate we're accumulating will have a variance reduction effect.

More Temporal Structure and Baseline

Current estimate:

\begin{align*} \displaystyle \hat{g} &= \frac{1}{m}\displaystyle\sum_{i=1}^m \nabla_\theta \log P(\tau^{(i)};\theta)(R(\tau^{(i)})-b) \\ &= \frac{1}{m}\displaystyle\sum_{i=1}^m \bigg(\displaystyle\sum_{t=0}^{H-1}\nabla_\theta \log \pi_\theta(u^{(i)}_t|s^{(i)}_t) \bigg)\bigg(\displaystyle\sum_{t=0}^{H-1}R(s^{(i)}_t,u^{(i)}_t)-b\bigg) \\ &= \frac{1}{m}\displaystyle\sum_{i=1}^m \bigg(\displaystyle\sum_{t=0}^{H-1}\nabla_\theta \log \pi_\theta(u^{(i)}_t|s^{(i)}_t) \Bigg[ (\displaystyle\sum_{k=0}^{t-1}R(s^{(i)}_k,u^{(i)}_k)) + (\displaystyle\sum_{k=t}^{H-1}R(s^{(i)}_k,u^{(i)}_k)) - b \Bigg]\bigg) \\ \end{align*}

Splitting the reward into reward from the past and reward from the future, and rewards from the past are not relevant, and doing the math carefully we see that the expected value is zero. Hence, we remove terms that don't contribute to remove variance and we have the practical equation for policy gradient:

\frac{1}{m}\displaystyle\sum_{i=1}^m \displaystyle\sum_{t=0}^{H-1}\nabla_\theta \log \pi_\theta(u^{(i)}_t|s^{(i)}_t) \Bigg( \displaystyle\sum_{k=t}^{H-1}R(s^{(i)}_k,u^{(i)}_k) - b(s_t^{(i)}) \Bigg)

Baseline Choices

What is a good choice for b?

Constant baseline:

b=\mathbb E[R(\tau)] \approx \frac{1}{m} \displaystyle\sum_{i=1}^m R(\tau^{(i)})

Optimal Constant baseline (minimal variance baseline):

b= \frac {\sum_{i} (\nabla_\theta \log P(\tau^{(i)};\theta))^2 R(\tau^{(i)})} {\sum_{i} (\nabla_\theta \log P(\tau^{(i)};\theta))^2 }

This is a weighted average of the reward by the square grad log of probability, hence, we'll weight more heavily the high probability trajectories. Not seen so many.

Time-dependent baseline:

b_t = \frac{1}{m} \displaystyle\sum_{i=1}^m \displaystyle\sum_{t=0}^{H-1} R(s_k^{(i)},u_k^{(i)})

It captures the fact that maybe at the end of the trajectory there are fewer rewards than at the beginning.

State-dependent expected return:

b(s_t) = \mathbb[r_t+r_{t+1}+r_{t+2}+...+r_{H-1}] = V^\pi(s_t)

This increases log prob of action proportionally to how much its returns are better than the expected return under the current policy.

Value function estimation

Monte Carlo Estimation of $V^\pi$

\frac{1}{m}\displaystyle\sum_{i=1}^m \displaystyle\sum_{t=0}^{H-1}\nabla_\theta \log \pi_\theta(u^{(i)}_t|s^{(i)}_t) \Bigg( \displaystyle\sum_{k=t}^{H-1}R(s^{(i)}_k,u^{(i)}_k) - \underbrace{V^\pi(s_t^{(i)})}_{\text{how to estimate?}} \Bigg)

One of the ways of estimating the value function is using Monte Carlo doing as follows:

Initialize $V^\pi_{\phi_0}$ $V_{ϕ_{0}}^{π}$
- Collect trajectories $\tau_1, \tau_2, …, \tau_m$
- Regress against empirical return:

\phi_{i+1} \larr \argmin_\phi \frac{1}{m} \displaystyle\sum_{i=1}^m \displaystyle\sum_{t=0}^{H-1} \Big( V^\pi(s_t^{(i)}) - (\displaystyle\sum_{k=t}^{H-1}R(s^{(i)}_k,u^{(i)}_k)) \Big)^2

This is the way to go first because it's simple, just a supervised learning problem.

Bootstrap Estimation of $V^\pi$

Bellman Equation for $V^\pi$ :

V^\pi(s) = \sum_u \pi(u|s) \sum_{s'} P(s'|s,u)[R(s,u,s')+\gamma V^\pi(s')]

Initialize $V^\pi_{\phi_0}$ $V_{ϕ_{0}}^{π}$
- Collect data $\{s, u, s', r\}$
- Fitted value iteration:
$\phi_{i+1} \larr \min_\phi \displaystyle\sum_{(s,u,s',r)} \| r+ V^\pi_{\phi_i}(s') - V{\phi}(s) \|_2^2 + \lambda \| \phi - \phi_i \|_2^2$

The last term is regularization, to penalize parameters to jump too far from the previous ones. This is typically more sample-efficient, but less stable, so a common practice is to start with the Monte Carlo Estimation approach.

Vanilla Policy Gradient

Untitled

Advantage Estimation (A2C/A3C/GAE)

Recall Our Likelihood Ratio Policy Gradient Estimator

\frac{1}{m} \displaystyle\sum_{i=1}^m \sum_{t=0}^{H-1} \nabla_\theta\log\pi_\theta(u_t^{(i)}| s_t^{(i)}) \Bigg ( \underbrace{ \sum_{k=t}^{H-1} R(s_k^{(i)},u_k^{(i)}) - V^\pi(s_k^{(i)}) }_{\text{Advantage}} \Bigg)

The advantage is the difference between the reward experienced and the expected reward. But the reward experience is still a single sample estimate, it's the estimation of Q from a single roll-out:

Q^\pi(s,u)=\mathbb E[r_0 + r_1 + r_2, ...|s_0=s,a_0=a]

In this way, we have a high variance per sample base. We could reduce the variance in both of the following ways:

Reduce variance by discounting.
Reduce variance by function approximation (critic).

Variance Reduction by Discounting

Introduce discount factor as a hyperparameter to improve the estimate of Q:

Q^{\pi,\gamma}(s,u)=\mathbb E[r_0+ \gamma r_1+ \gamma^2 r_2+ ...|s_0=s,a_0=a]

The discount factor weights the importance of immediate and long-term rewards. We also have to discount the baseline, because the effectiveness of an action has more impact in the short-term.

Variance Reduction by Function Approximation

When we use the value function to approximate part of the Q-value, we introduce some error, some bias, but then we reduce the variance, because the V-value is an estimate based on many past experiences. This is a bias-variance trade-off, but the advantage is that with less variance we help training to run faster.

\begin{align*} Q^{\pi,\gamma}(s,u)&=\mathbb E[r_0+ \gamma r_1+ \gamma^2 r_2+ ...|s_0=s,a_0=a]\\ &=\mathbb E[r_0+ \gamma V^\pi(s_1)+ ...|s_0=s,a_0=a] \\ &=\mathbb E[r_0+ \gamma r_1+ \gamma^2 V^\pi(s_2)+...|s_0=s,a_0=a] \\ &=\mathbb E[r_0+ \gamma r_1+ \gamma^2 r_2+ \gamma^3 V^\pi(s_3)+ ...|s_0=s,a_0=a] \\ &= ... \end{align*}

Async Advantage Actor Critic (A3C) [Mnih et al, 2016]
- $\hat{Q}=$ one of the above choices (e.g. k=5 step lookahead).
Generalized Advantage Estimation (GAE) [Schulman et al, ICLR 2016]
- $\hat{Q}=$ lambda exponentially weighted average ( $(1-\lambda)\lambda^k$ ) of all the above gives better estimate than one specific choice of lambda. Similar work to TD (lambda) / eligibility traces [Sutton and Barton, 1990].

Policy Gradient with A3C or GAE

Policy Gradient + Generalized Advantage Estimation:

Initialize $\pi_{\theta_0} V^\pi_{\phi_0}$ $π_{θ_{0}} V_{ϕ_{0}}^{π}$
- Collect roll-outs $\{s, u, s', r\}$ and $\hat{Q_i}(s,u)$
- Update:
$\begin{gather} \phi_{i+1} \larr \min_\phi \displaystyle\sum_{(s,u,s',r)} \| \hat{Q_i}(s,u) - V^\pi_{\phi}(s) \|_2^2 + \kappa \| \phi - \phi_i \|_2^2 \\ \theta_{i+1} \larr \theta_i + \alpha \frac{1}{m} \displaystyle\sum_{k=1}^m \sum_{t=0}^{H-1} \nabla_\theta\log\pi_\theta(u_t^{(k)}| s_t^{(k)}) \Bigg ( \hat{Q_i}(s_t^{(k)},u_t^{(k)}) - V_{\phi_i}^\pi(s_k^{(k)}) \Bigg) \end{gather}$

There are many variations of this algorithm, e. g. we could instead use a 1-step for V, and full roll-out for $\pi_\theta$ :

\begin{gather} \phi_{i+1} \larr \min_\phi \displaystyle\sum_{(s,u,s',r)} \| r+ V^\pi_{\phi_i}(s') - V{\phi}(s) \|_2^2 + \lambda \| \phi - \phi_i \|_2^2 \\ \theta_{i+1} \larr \theta_i + \alpha \frac{1}{m} \displaystyle\sum_{k=1}^m \sum_{t=0}^{H-1} \nabla_\theta\log\pi_\theta(u_t^{(k)}| s_t^{(k)}) \Bigg ( \sum_{t'=t}^{H-1}r^{(k)}_{t'} - V_{\phi_i}^\pi(s_{t'}^{(k)}) \Bigg) \end{gather}

Policy Gradient derivation

Why Policy Optimization​

Likelihood Ration Policy Gradient​

Temporal Decomposition

Likelihood Ratio Gradient Estimate​

Likelihood Ratio Gradient: Intuition​

Baseline Subtraction

Can we do that?​

More Temporal Structure and Baseline​

Baseline Choices​

Value function estimation

Monte Carlo Estimation of VπV^\piVπ​

Bootstrap Estimation of VπV^\piVπ​

Vanilla Policy Gradient​

Advantage Estimation (A2C/A3C/GAE)

Recall Our Likelihood Ratio Policy Gradient Estimator​

Variance Reduction by Discounting​

Variance Reduction by Function Approximation​

Policy Gradient with A3C or GAE​

Why Policy Optimization

Likelihood Ration Policy Gradient

Likelihood Ratio Gradient Estimate

Likelihood Ratio Gradient: Intuition

Can we do that?

More Temporal Structure and Baseline

Baseline Choices

Monte Carlo Estimation of $V^\pi$

Bootstrap Estimation of $V^\pi$

Vanilla Policy Gradient

Recall Our Likelihood Ratio Policy Gradient Estimator

Variance Reduction by Discounting

Variance Reduction by Function Approximation

Policy Gradient with A3C or GAE