Skip to main content

TRPO and PPO

Work in Progress

These are my rough notes.

Policy Optimization

Figure 1: Actor-environment interaction in policy gradient methods.

Figure 1: Actor-environment interaction in policy gradient methods.

Our policy under the hood is a neural network, and we change the parameters of the neural network in order to find the best set of parameters that helps us collect high rewards in the environment. Formally, we solve an optimization problem:

  • Consider control policy parameterized by parameter vector θ\theta:
maxθE[t=0HR(st)πθ]\max_\theta \mathbb E \bigg[ \sum_{t=0}^H R(s_t)|\pi_\theta \bigg]
  • Stochastic policy class (smooths out the problem and allows us to solve problems where the reward itself is not differentiable):
πθ(us):probability of action u in state s\pi_\theta(u|s):\text{probability of action u in state s}

Vanilla Policy Gradient

Untitled

We are going to part from the vanilla policy gradient algorithm. The trajectories with positive advantage (rewards higher than the baseline) will have higher probabilities.

Surrogate Loss

Derivation from Importance Sampling

U(θ)=Eτθold[P(τθ)P(τθold)R(τ)]U(\theta) = \mathbb E_{\tau \sim \theta_{old}} \Bigg[ \frac{P(\tau|\theta)}{P(\tau|\theta_{old})} R(\tau) \Bigg]

The utility of theta is the expected reward when using policy
πθ\pi_\theta. When we compute the gradient, we can collect trajectories with the old policy and see the directions that we need to improve, it doesn't matter theta we are at. When we are at theta old, it turns out to be a standard policy gradient. If we are in theta far from theta old, we have high variance and need more data to estimate the gradient.

θU(θ)=Eτθold[θP(τθ)P(τθold)R(τ)]θU(θ)θ=θold=Eτθold[θP(τθ)θ=θoldP(τθold)R(τ)]=Eτθold[θlogP(τθ)θoldR(τ)]\begin{align*} \nabla_\theta U(\theta) &= \mathbb E_{\tau \sim \theta_{old}} \Bigg[ \frac{\nabla_\theta P(\tau|\theta)}{P(\tau|\theta_{old})} R(\tau) \Bigg] \\ \nabla_\theta U(\theta) |_{\theta=\theta_{\text{old}}} &= \mathbb E_{\tau \sim \theta_{old}} \Bigg[ \frac{\nabla_\theta P(\tau|\theta) |_{\theta=\theta_{\text{old}}} }{P(\tau|\theta_{old})} R(\tau) \Bigg] \\ &= \mathbb E_{\tau \sim \theta_{old}} [ \nabla_\theta \log P(\tau|\theta)|_{\theta_{\text{old}}} R(\tau) ] \end{align*}

It is as if we are making a first-order approximation to the alternative loss function U(θ)U(\theta).

Step-sizing and Trust Region Policy Optimization (TRPO)

Step-Sizing

Step-sizing is a way to adjust the size of our optimization problem, so we can avoid taking too large a step and overshooting the optimal solution since the gradient is only a first-order approximation. One common approach to step-sizing is to use a line search algorithm, where we iteratively try different step sizes until we find one that gives us the best results.

  • Supervised learning
    • Step too far: next update will correct for it.
  • Reinforcement learning
    • Step too far: terrible policy. The terrible policy will give terrible data, terrible data might not have any signal at all in it.
    • Next mini-batch: collected under this terrible policy.
    • Not clear how to recover short of going back and shrinking the step size.

Simple Step-sizing

  • Simple step-sizing: Line search in direction of gradient:
    • Simple, but expensive (evaluations along the line).
    • Naive: ignores where the first-order approximation is good/poor.

Trust Region Policy Optimization (TRPO)

TRPO is a method for optimizing policies that is designed to be more stable and reliable than vanilla policy gradient methods. It does this by constraining the size of the change in the policy parameters to a certain region around the current policy. This helps to ensure that we don't make changes that are too drastic, which can lead to instability in the optimization process.

Surrogate loss:maxπL(π)=Eπold[π(as)πoldAπold(s,a)]Constraint:Eπold[KL(ππold)]ϵ\begin{align*} \text{Surrogate loss} &: \max_\pi L(\pi) = \mathbb E_{\pi_{\text{old}}} \Big[ \frac{\pi(a|s)}{\pi_{\text{old}}} A^{\pi_\text{old}}(s,a) \Big] \\ \text{Constraint} &: \mathbb E_{\pi_{\text{old}}}[KL(\pi||\pi_\text{old})] \le \epsilon \end{align*}

Untitled

Evaluating the KL

  • Recall:
    • When we look at the distance between distribution over trajectories, we have this product of an action given state and probability of the next state given state and action:
P(τ;θ)=P(s0)t=0H1πθ(utst)P(st+1st,ut)P(\tau;\theta)=P(s_0)\prod^{H-1}_{t=0}\pi_\theta(u_t|s_t)P(s_{t+1}|s_t,u_t)
  • Hence:
    • KL between two distributions over trajectories between two policies, when expanding the dynamics cancels outs. The dynamics up front are just an expectation, which means we can just sample from the current policy.
KL(P(τ;θ)P(τ;θ+δθ))=τP(τ;θ)logP(τ;θ)P(τ;θ+δθ)=τP(τ;θ)logP(s0)t=0H1πθ(utst)P(st+1st,ut)P(s0)t=0H1πθ+δθ(utst)P(st+1st,ut)=τP(τ;θ)logt=0H1πθ(utst)t=0H1πθ+δθ(utst)1m(s,u) in roll-outs under θlogt=0H1πθ(us)t=0H1πθ+δθ(us)\begin{align*} KL(P(\tau;\theta)||P(\tau;\theta+\delta\theta)) &= \sum_\tau P(\tau;\theta)\log \frac{P(\tau;\theta)}{P(\tau;\theta+\delta\theta)} \\ &= \sum_\tau P(\tau;\theta)\log \frac{ P(s_0)\prod^{H-1}_{t=0}\pi_\theta(u_t|s_t)P(s_{t+1}|s_t,u_t) }{ P(s_0)\prod^{H-1}_{t=0}\pi_{\theta+\delta\theta}(u_t|s_t)P(s_{t+1}|s_t,u_t) }\\ &= \sum_\tau P(\tau;\theta)\log \frac{ \prod^{H-1}_{t=0}\pi_\theta(u_t|s_t) }{ \prod^{H-1}_{t=0}\pi_{\theta+\delta\theta}(u_t|s_t) }\\ &\approx \frac{1}{m} \sum_{(s,u) \text{ in roll-outs under } \theta} \log \frac{ \prod^{H-1}_{t=0}\pi_\theta(u|s) }{ \prod^{H-1}_{t=0}\pi_{\theta+\delta\theta}(u|s) }\\ \end{align*}

Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO)

PPO is a family of policy gradient methods that aim to address some of the limitations of previous methods such as TRPO, such as eliminating the second-order optimization introduced by the constraint. The core idea behind PPO is to use a clipped surrogate objective function that encourages small changes to the policy, while still allowing for larger changes if they are strongly beneficial. This helps to avoid the instability that can arise from making significant changes to the policy parameters.

A better TRPO?

  • Not easy to enforce trust region constraint for complex policy architectures.
    • Networks that have stochasticity like dropout.
    • Parameter sharing between policy and value function.
  • Conjugate Gradient implementation is complex.
  • Would be good to harness good first-order optimizers like Adam, RMSProp, …

Proximal Policy Optimization V1 - “Dual Descent TRPO”

What if we move the constraint into the objective? There is a value of beta which makes the two problems equivalent, that they'll have the same solution. Now we have an optimization problem, and we can run it with Adam or another optimizer and iterate over that.

TRPO

maxθE^t[πθ(atst)πθold(atst)A^t]subject toE^t[KL[πθold(st),πθ(st)]]δ\begin{aligned} \max_\theta \quad& \mathbb {\hat{E}_t} \Big [ \frac{\pi_\theta(a_t|s_t)}{\pi_\theta{\text{old}}(a_t|s_t)} \hat{A}_t \Big ]\\ \textrm{subject to} \quad& \mathbb {\hat{E}}_t[KL[\pi_{\theta_\text{old}}(\cdot|s_t),\pi_\theta(\cdot|s_t)]] \le \delta \end{aligned}

PPO v1

maxθE^t[πθ(atst)πθold(atst)A^t]β(E^t[KL[πθold(st),πθ(st)]]δ)\max_\theta \mathbb {\hat{E}}_t \Big [ \frac{\pi_\theta(a_t|s_t)}{\pi_\theta{\text{old}}(a_t|s_t)} \hat{A}_t \Big ]-\beta \Big( \mathbb {\hat{E}}_t[KL[\pi_{\theta_\text{old}}(\cdot|s_t),\pi_\theta(\cdot|s_t)]] - \delta \Big)

Untitled

We can update the beta at the end of each iteration. If the KL is close to delta it won't have much weight on the objective. If KL is much larger than delta we can crank up beta to pay more attention to the KL. If the KL is much smaller than delta, we can decrease the beta so that in the next optimization round it doesn't pay as much attention to the KL as it did before.

Proximal Policy Optimization V2 - “Clipped Surrogate Loss”

Clipped Surrogate Loss

The surrogate loss is a modification of the maximum likelihood objective used in vanilla policy gradient methods. It is defined as follows:

LCLIP(θ)=E^t[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]L^{CLIP}(\theta) = \mathbb {\hat{E}}_t \bigg[ \min \Big( r_t(\theta) \hat A_t, clip(r_t(\theta), 1 - \epsilon, 1 + \epsilon) \hat A_t \Big) \bigg]

where rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} is the probability ratio between the new policy and the old policy for the action taken at time step tt, A^t\hat A_t is the advantage estimate at time step tt, and clip(x,a,b)clip(x, a, b) is a function that clips xx to the range [a,b][a, b]. This objective encourages small changes to the policy, while still allowing for larger changes if they are strongly beneficial.

Untitled