Skip to main content

Policy Gradients and Advantage Estimation

Work in Progress

These are my rough notes.

Policy Gradient derivation

In policy gradient methods, the agent is really a policy πθ(us)\pi_\theta(u|s) that chooses action uu given the current input ss. For us, it will be a neural network, with parameters θ\theta.

Figure 1: Actor-environment interaction in policy gradient methods.

Figure 1: Actor-environment interaction in policy gradient methods.

  • Control policy parameterized by parameter vector θ\theta:
maxθ E[t=0HR(stπθ)]\underset{\theta}{max} \ \mathbb E[\displaystyle\sum_{t=0}^HR(s_t|\pi_\theta)]
  • Stochastic policy class (smooths out the problem):
πθ(us):probability of action u in state s\pi_\theta(u|s): \text{probability of action u in state s}

Why Policy Optimization

  • Often π\pi can be simpler than QQ or VV.
    • E.g., robotic grasp.
  • VV: doesn't prescribe actions.
    • Would need a dynamics model to look ahead against the value function.
  • QQ: need to be able to efficiently solve argmaxu Qθ(s,u)\underset{u}{argmax} \ Q_{\theta}(s,u).
    • The challenge for continuous/high-dimensional action spaces.

Likelihood Ration Policy Gradient

We let τ\tau denote a state-action sequence s0,uo,,sH,uHs_0, u_o, …, s_H, u_H, so we overload notation: R(τ)=t=0HR(st,ut)R(\tau)=\sum_{t=0}^HR(s_t, u_t). Meaning, the reward for the trajectory τ\tau is a sum of the rewards for each state-action pair.

We optimize for the utility, for the parameter vector θ\theta:

U(θ)=E[t=0HR(st,ut;πθ)]=τP(τ;θ)R(τ)U(\theta)=\mathbb E[\displaystyle\sum_{t=0}^HR(s_t,u_t;\pi_\theta)] = \displaystyle\sum_{\tau}P(\tau;\theta)R(\tau)

The sum of the rewards is weighted by its probability, given the parameters θ\theta, for every possible trajectory.

Our goal is to find θ\theta:

maxθ U(θ)=maxθτP(τ;θ)R(τ)\underset{\theta}{max} \ U(\theta)=\underset{\theta}{max}\displaystyle\sum_{\tau}P(\tau;\theta)R(\tau)

Some settings favor some trajectories, so we have to find the settings that maximize the expected reward.

Taking the gradient of U(θ)U(\theta) with relation to θ\theta gives:

θU(θ)=θτP(τ;θ)R(τ)=τθP(τ;θ)R(τ)\begin{align*} \nabla_\theta U(\theta) &= \nabla_\theta \displaystyle\sum_{\tau}P(\tau;\theta)R(\tau) \\ &= \displaystyle\sum_{\tau}\nabla_\theta P(\tau;\theta)R(\tau) \end{align*} =τP(τ;θ)P(τ;θ)θP(τ;θ)R(τ)=τP(τ;θ)θP(τ;θ)P(τ;θ)R(τ)=τP(τ;θ)θlogP(τ;θ)R(τ)\begin{align*} &= \displaystyle\sum_{\tau}\frac{P(\tau;\theta)}{P(\tau;\theta)}\nabla_\theta P(\tau;\theta)R(\tau) \\ &= \displaystyle\sum_{\tau}P(\tau;\theta)\frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)}R(\tau) \\ &= \displaystyle\sum_{\tau}P(\tau;\theta)\nabla_\theta \log P(\tau;\theta)R(\tau) \\ \end{align*}

We have an expectation so we can compute a sample-based approximation. We don't have to enumerate all possible trajectories, which would be impossible in realistic problems.

Approximate with the empirical estimate for mm sample paths under policy πθ\pi_\theta:

θU(θ)g^=1mi=1mθlogP(τ(i);θ)R(τ(i))\nabla_\theta U(\theta) \approx \displaystyle \hat{g} = \frac{1}{m}\displaystyle\sum_{i=1}^m \nabla_\theta \log P(\tau^{(i)};\theta)R(\tau^{(i)})

So, to get our gradient estimate, we're gonna use our current setting of the parameters, do a bunch of rollouts, and then for each rollout, we're going to compute grad log probability of the trajectory under the current parameter setting times the reward collected along that trajectory.

The intuition is that by controlling parameter setting θ\theta we can increase the probability of high reward trajectories, and decrease low reward trajectories.

Untitled

Temporal Decomposition

Let's decompose the path into states and actions:

θlogP(τ(i);θ)=θlog[t=0HP(st+1(i)st(i),ut(i))dynamics modelπθ(ut(i)st(i))policy]\nabla_\theta\log P(\tau^{(i)}; \theta)=\nabla_\theta \log \bigg [ \displaystyle\prod_{t=0}^H \underbrace{P(s_{t+1}^{(i)}|s_t^{(i)}, u_t^{(i)})}_{\text{dynamics model}} \cdot \underbrace{\pi_\theta(u_t^{(i)}|s_t^{(i)})}_{\text{policy}} \bigg]

The grad log probability of trajectory under the current parameter vector θ\theta of the policy under the hood is a product of probabilities, the probability of the next state, given current state and action, and the probability of action given state.

θlogP(τ(i);θ)=θlog[t=0HP(st+1(i)st(i),ut(i))dynamics modelπθ(ut(i)st(i))policy]=θ[t=0HlogP(st+1(i)st(i),ut(i))+t=0Hlogπθ(ut(i)st(i))]=θt=0Hlogπθ(ut(i)st(i))=t=0Hθlogπθ(ut(i)st(i))no dynamics model required!\begin{align*} \nabla_\theta\log P(\tau^{(i)}; \theta) &= \nabla_\theta \log \bigg [ \displaystyle\prod_{t=0}^H \underbrace{P(s_{t+1}^{(i)}|s_t^{(i)}, u_t^{(i)})}_{\text{dynamics model}} \cdot \underbrace{\pi_\theta(u_t^{(i)}|s_t^{(i)})}_{\text{policy}} \bigg] \\ &= \nabla_\theta \bigg [ \displaystyle\sum_{t=0}^H \log P(s_{t+1}^{(i)}|s_t^{(i)}, u_t^{(i)}) + \displaystyle\sum_{t=0}^H \log \pi_\theta(u_t^{(i)}|s_t^{(i)}) \bigg] \\ &= \nabla_\theta \displaystyle\sum_{t=0}^H \log \pi_\theta(u_t^{(i)}|s_t^{(i)}) \\ &= \displaystyle\sum_{t=0}^H \underbrace{\nabla_\theta\log \pi_\theta(u_t^{(i)}|s_t^{(i)})}_{\text{no dynamics model required!}} \\ \end{align*}

Because the first sum doesn't have a dependency on θ\theta, the gradient has no dynamics model contribution.

Likelihood Ratio Gradient Estimate

Estimate:

g^=1mi=1mθlogP(τ(i);θ)R(τ(i))\hat{g} = \frac{1}{m} \displaystyle\sum_{i=1}^m \nabla_\theta \log P(\tau^{(i)};\theta)R(\tau^{(i)})

And the first part we discovered that we can compute without access to a dynamics model, we can do this directly based on the neural network that represents our policy.

θlogP(τ(i);θ)=t=0Hθlogπθ(ut(i)st(i))no dynamics model required!\nabla_\theta\log P(\tau^{(i)}; \theta) = \displaystyle\sum_{t=0}^H \underbrace{\nabla_\theta\log \pi_\theta(u_t^{(i)}|s_t^{(i)})}_{\text{no dynamics model required!}}

We can rollout the current policy, then get the rewards along each trajectory, and we know the state and actions along these trajectories. So for each state and action experienced, we compute grad log probability of the action given state and then accumulate the grad log probability of the trajectory, multiply with your reward along that trajectory, and you have your gradient.

An unbiased estimate of the gradient:

E[g^]=θU(θ)\mathbb E[ \hat{g}] = \nabla_\theta U(\theta)
  • As formulated thus far: unbiased but very noisy. It's sample-based and if you don't have enough samples, it's not going to be very precise. Fixes that lead to real-world practicality:
    • Baseline
    • Temporal Structure
    • And next lecture: Trust region / natural gradient

Likelihood Ratio Gradient: Intuition

Gradient tries to:

  • Increase the log probability of trajectories that are above average, regarding RR
  • Decrease the log probability of trajectories that are below average, regarding RR

Untitled

Baseline Subtraction

We have the gradient:

θU(θ)g^=1mi=1mθlogP(τ(i);θ)R(τ(i))\nabla_\theta U(\theta) \approx \displaystyle \hat{g} = \frac{1}{m}\displaystyle\sum_{i=1}^m \nabla_\theta \log P(\tau^{(i)};\theta)R(\tau^{(i)})

Now consider baseline bb:

θU(θ)g^=1mi=1mθlogP(τ(i);θ)(R(τ(i))  b)still unbiased [Williams 1992]\begin{align*} \nabla_\theta U(\theta) \approx \displaystyle \hat{g} = \frac{1}{m}\displaystyle\sum_{i=1}^m \nabla_\theta \log P(\tau^{(i)};\theta)(R(\tau^{(i)})~-&~b) \\ &\uparrow \\ \text{still unbiased} &\text{ [Williams 1992]} \end{align*}

We subtract the average reward of the rollout of all trajectories bb.

Can we do that?

E[θlogP(τ;θ) b]=τP(τ;θ)θlogP(τ;θ) b=τP(τ;θ)θP(τ;θ)P(τ;θ) b=τθP(τ;θ) b=θ(τP(τ) b)=bθ(τP(τ))As long as baseline doesn’t depend on action in logprob (action)=b×0\begin{align*} \mathbb E[\nabla_\theta \log P(\tau;\theta)~b] &=\displaystyle\sum_\tau P(\tau;\theta) \nabla_\theta \log P(\tau;\theta)~b \\ &=\displaystyle\sum_\tau P(\tau;\theta) \frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)} ~b \\ &=\displaystyle\sum_\tau \nabla_\theta P(\tau;\theta)~b \\ &=\nabla_\theta \bigg (\displaystyle\sum_\tau P(\tau)~b \bigg ) \\ &=b \nabla_\theta \bigg (\displaystyle\sum_\tau P(\tau) \bigg ) \\ &\hookrightarrow \text{As long as baseline doesn't depend on action in logprob (action)} \\ &=b \times0 \end{align*}

On expectation, the extra term will be zero. So why care to add the term bb? When there are finite samples, the estimate we're accumulating will have a variance reduction effect.

More Temporal Structure and Baseline

Current estimate:

g^=1mi=1mθlogP(τ(i);θ)(R(τ(i))b)=1mi=1m(t=0H1θlogπθ(ut(i)st(i)))(t=0H1R(st(i),ut(i))b)=1mi=1m(t=0H1θlogπθ(ut(i)st(i))[(k=0t1R(sk(i),uk(i)))+(k=tH1R(sk(i),uk(i)))b])\begin{align*} \displaystyle \hat{g} &= \frac{1}{m}\displaystyle\sum_{i=1}^m \nabla_\theta \log P(\tau^{(i)};\theta)(R(\tau^{(i)})-b) \\ &= \frac{1}{m}\displaystyle\sum_{i=1}^m \bigg(\displaystyle\sum_{t=0}^{H-1}\nabla_\theta \log \pi_\theta(u^{(i)}_t|s^{(i)}_t) \bigg)\bigg(\displaystyle\sum_{t=0}^{H-1}R(s^{(i)}_t,u^{(i)}_t)-b\bigg) \\ &= \frac{1}{m}\displaystyle\sum_{i=1}^m \bigg(\displaystyle\sum_{t=0}^{H-1}\nabla_\theta \log \pi_\theta(u^{(i)}_t|s^{(i)}_t) \Bigg[ (\displaystyle\sum_{k=0}^{t-1}R(s^{(i)}_k,u^{(i)}_k)) + (\displaystyle\sum_{k=t}^{H-1}R(s^{(i)}_k,u^{(i)}_k)) - b \Bigg]\bigg) \\ \end{align*}

Splitting the reward into reward from the past and reward from the future, and rewards from the past are not relevant, and doing the math carefully we see that the expected value is zero. Hence, we remove terms that don't contribute to remove variance and we have the practical equation for policy gradient:

1mi=1mt=0H1θlogπθ(ut(i)st(i))(k=tH1R(sk(i),uk(i))b(st(i)))\frac{1}{m}\displaystyle\sum_{i=1}^m \displaystyle\sum_{t=0}^{H-1}\nabla_\theta \log \pi_\theta(u^{(i)}_t|s^{(i)}_t) \Bigg( \displaystyle\sum_{k=t}^{H-1}R(s^{(i)}_k,u^{(i)}_k) - b(s_t^{(i)}) \Bigg)

Baseline Choices

What is a good choice for b?

  • Constant baseline:
b=E[R(τ)]1mi=1mR(τ(i))b=\mathbb E[R(\tau)] \approx \frac{1}{m} \displaystyle\sum_{i=1}^m R(\tau^{(i)})
  • Optimal Constant baseline (minimal variance baseline):
b=i(θlogP(τ(i);θ))2R(τ(i))i(θlogP(τ(i);θ))2b= \frac {\sum_{i} (\nabla_\theta \log P(\tau^{(i)};\theta))^2 R(\tau^{(i)})} {\sum_{i} (\nabla_\theta \log P(\tau^{(i)};\theta))^2 }

This is a weighted average of the reward by the square grad log of probability, hence, we'll weight more heavily the high probability trajectories. Not seen so many.

  • Time-dependent baseline:
bt=1mi=1mt=0H1R(sk(i),uk(i))b_t = \frac{1}{m} \displaystyle\sum_{i=1}^m \displaystyle\sum_{t=0}^{H-1} R(s_k^{(i)},u_k^{(i)})

It captures the fact that maybe at the end of the trajectory there are fewer rewards than at the beginning.

  • State-dependent expected return:
b(st)=[rt+rt+1+rt+2+...+rH1]=Vπ(st)b(s_t) = \mathbb[r_t+r_{t+1}+r_{t+2}+...+r_{H-1}] = V^\pi(s_t)

This increases log prob of action proportionally to how much its returns are better than the expected return under the current policy.

Value function estimation

Monte Carlo Estimation of VπV^\pi

1mi=1mt=0H1θlogπθ(ut(i)st(i))(k=tH1R(sk(i),uk(i))Vπ(st(i))how to estimate?)\frac{1}{m}\displaystyle\sum_{i=1}^m \displaystyle\sum_{t=0}^{H-1}\nabla_\theta \log \pi_\theta(u^{(i)}_t|s^{(i)}_t) \Bigg( \displaystyle\sum_{k=t}^{H-1}R(s^{(i)}_k,u^{(i)}_k) - \underbrace{V^\pi(s_t^{(i)})}_{\text{how to estimate?}} \Bigg)

One of the ways of estimating the value function is using Monte Carlo doing as follows:

  • Initialize Vϕ0πV^\pi_{\phi_0}
    • Collect trajectories τ1,τ2,,τm\tau_1, \tau_2, …, \tau_m
    • Regress against empirical return:
ϕi+1arg minϕ1mi=1mt=0H1(Vπ(st(i))(k=tH1R(sk(i),uk(i))))2\phi_{i+1} \larr \argmin_\phi \frac{1}{m} \displaystyle\sum_{i=1}^m \displaystyle\sum_{t=0}^{H-1} \Big( V^\pi(s_t^{(i)}) - (\displaystyle\sum_{k=t}^{H-1}R(s^{(i)}_k,u^{(i)}_k)) \Big)^2

This is the way to go first because it's simple, just a supervised learning problem.

Bootstrap Estimation of VπV^\pi

  • Bellman Equation for VπV^\pi:
Vπ(s)=uπ(us)sP(ss,u)[R(s,u,s)+γVπ(s)]V^\pi(s) = \sum_u \pi(u|s) \sum_{s'} P(s'|s,u)[R(s,u,s')+\gamma V^\pi(s')]
  • Initialize Vϕ0πV^\pi_{\phi_0}
    • Collect data {s,u,s,r}\{s, u, s', r\}
    • Fitted value iteration:
    ϕi+1minϕ(s,u,s,r)r+Vϕiπ(s)Vϕ(s)22+λϕϕi22\phi_{i+1} \larr \min_\phi \displaystyle\sum_{(s,u,s',r)} \| r+ V^\pi_{\phi_i}(s') - V{\phi}(s) \|_2^2 + \lambda \| \phi - \phi_i \|_2^2

The last term is regularization, to penalize parameters to jump too far from the previous ones. This is typically more sample-efficient, but less stable, so a common practice is to start with the Monte Carlo Estimation approach.

Vanilla Policy Gradient

Untitled

Advantage Estimation (A2C/A3C/GAE)

Recall Our Likelihood Ratio Policy Gradient Estimator

1mi=1mt=0H1θlogπθ(ut(i)st(i))(k=tH1R(sk(i),uk(i))Vπ(sk(i))Advantage)\frac{1}{m} \displaystyle\sum_{i=1}^m \sum_{t=0}^{H-1} \nabla_\theta\log\pi_\theta(u_t^{(i)}| s_t^{(i)}) \Bigg ( \underbrace{ \sum_{k=t}^{H-1} R(s_k^{(i)},u_k^{(i)}) - V^\pi(s_k^{(i)}) }_{\text{Advantage}} \Bigg)

The advantage is the difference between the reward experienced and the expected reward. But the reward experience is still a single sample estimate, it's the estimation of Q from a single roll-out:

Qπ(s,u)=E[r0+r1+r2,...s0=s,a0=a]Q^\pi(s,u)=\mathbb E[r_0 + r_1 + r_2, ...|s_0=s,a_0=a]

In this way, we have a high variance per sample base. We could reduce the variance in both of the following ways:

  1. Reduce variance by discounting.
  2. Reduce variance by function approximation (critic).

Variance Reduction by Discounting

Introduce discount factor as a hyperparameter to improve the estimate of Q:

Qπ,γ(s,u)=E[r0+γr1+γ2r2+...s0=s,a0=a]Q^{\pi,\gamma}(s,u)=\mathbb E[r_0+ \gamma r_1+ \gamma^2 r_2+ ...|s_0=s,a_0=a]

The discount factor weights the importance of immediate and long-term rewards. We also have to discount the baseline, because the effectiveness of an action has more impact in the short-term.

Variance Reduction by Function Approximation

When we use the value function to approximate part of the Q-value, we introduce some error, some bias, but then we reduce the variance, because the V-value is an estimate based on many past experiences. This is a bias-variance trade-off, but the advantage is that with less variance we help training to run faster.

Qπ,γ(s,u)=E[r0+γr1+γ2r2+...s0=s,a0=a]=E[r0+γVπ(s1)+...s0=s,a0=a]=E[r0+γr1+γ2Vπ(s2)+...s0=s,a0=a]=E[r0+γr1+γ2r2+γ3Vπ(s3)+...s0=s,a0=a]=...\begin{align*} Q^{\pi,\gamma}(s,u)&=\mathbb E[r_0+ \gamma r_1+ \gamma^2 r_2+ ...|s_0=s,a_0=a]\\ &=\mathbb E[r_0+ \gamma V^\pi(s_1)+ ...|s_0=s,a_0=a] \\ &=\mathbb E[r_0+ \gamma r_1+ \gamma^2 V^\pi(s_2)+...|s_0=s,a_0=a] \\ &=\mathbb E[r_0+ \gamma r_1+ \gamma^2 r_2+ \gamma^3 V^\pi(s_3)+ ...|s_0=s,a_0=a] \\ &= ... \end{align*}
  • Async Advantage Actor Critic (A3C) [Mnih et al, 2016]
    • Q^=\hat{Q}= one of the above choices (e.g. k=5 step lookahead).
  • Generalized Advantage Estimation (GAE) [Schulman et al, ICLR 2016]
    • Q^=\hat{Q}= lambda exponentially weighted average ( (1λ)λk(1-\lambda)\lambda^k ) of all the above gives better estimate than one specific choice of lambda. Similar work to TD (lambda) / eligibility traces [Sutton and Barton, 1990].

Policy Gradient with A3C or GAE

Policy Gradient + Generalized Advantage Estimation:

  • Initialize πθ0Vϕ0π\pi_{\theta_0} V^\pi_{\phi_0}
    • Collect roll-outs {s,u,s,r}\{s, u, s', r\} and Qi^(s,u)\hat{Q_i}(s,u)
    • Update:
    ϕi+1minϕ(s,u,s,r)Qi^(s,u)Vϕπ(s)22+κϕϕi22θi+1θi+α1mk=1mt=0H1θlogπθ(ut(k)st(k))(Qi^(st(k),ut(k))Vϕiπ(sk(k)))\begin{gather} \phi_{i+1} \larr \min_\phi \displaystyle\sum_{(s,u,s',r)} \| \hat{Q_i}(s,u) - V^\pi_{\phi}(s) \|_2^2 + \kappa \| \phi - \phi_i \|_2^2 \\ \theta_{i+1} \larr \theta_i + \alpha \frac{1}{m} \displaystyle\sum_{k=1}^m \sum_{t=0}^{H-1} \nabla_\theta\log\pi_\theta(u_t^{(k)}| s_t^{(k)}) \Bigg ( \hat{Q_i}(s_t^{(k)},u_t^{(k)}) - V_{\phi_i}^\pi(s_k^{(k)}) \Bigg) \end{gather}

There are many variations of this algorithm, e. g. we could instead use a 1-step for V, and full roll-out for πθ\pi_\theta:

ϕi+1minϕ(s,u,s,r)r+Vϕiπ(s)Vϕ(s)22+λϕϕi22θi+1θi+α1mk=1mt=0H1θlogπθ(ut(k)st(k))(t=tH1rt(k)Vϕiπ(st(k)))\begin{gather} \phi_{i+1} \larr \min_\phi \displaystyle\sum_{(s,u,s',r)} \| r+ V^\pi_{\phi_i}(s') - V{\phi}(s) \|_2^2 + \lambda \| \phi - \phi_i \|_2^2 \\ \theta_{i+1} \larr \theta_i + \alpha \frac{1}{m} \displaystyle\sum_{k=1}^m \sum_{t=0}^{H-1} \nabla_\theta\log\pi_\theta(u_t^{(k)}| s_t^{(k)}) \Bigg ( \sum_{t'=t}^{H-1}r^{(k)}_{t'} - V_{\phi_i}^\pi(s_{t'}^{(k)}) \Bigg) \end{gather}