TRPO and PPO
These are my rough notes.
Policy Optimization
Figure 1: Actor-environment interaction in policy gradient methods.
Our policy under the hood is a neural network, and we change the parameters of the neural network in order to find the best set of parameters that helps us collect high rewards in the environment. Formally, we solve an optimization problem:
- Consider control policy parameterized by parameter vector :
- Stochastic policy class (smooths out the problem and allows us to solve problems where the reward itself is not differentiable):
Vanilla Policy Gradient
We are going to part from the vanilla policy gradient algorithm. The trajectories with positive advantage (rewards higher than the baseline) will have higher probabilities.
Surrogate Loss
Derivation from Importance Sampling
The utility of theta is the expected reward when using policy
. When we compute the gradient, we can collect trajectories with the old policy and see the directions that we need to improve, it doesn't matter theta we are at. When we are at theta old, it turns out to be a standard policy gradient. If we are in theta far from theta old, we have high variance and need more data to estimate the gradient.
It is as if we are making a first-order approximation to the alternative loss function .
Step-sizing and Trust Region Policy Optimization (TRPO)
Step-Sizing
Step-sizing is a way to adjust the size of our optimization problem, so we can avoid taking too large a step and overshooting the optimal solution since the gradient is only a first-order approximation. One common approach to step-sizing is to use a line search algorithm, where we iteratively try different step sizes until we find one that gives us the best results.
- Supervised learning
- Step too far: next update will correct for it.
- Reinforcement learning
- Step too far: terrible policy. The terrible policy will give terrible data, terrible data might not have any signal at all in it.
- Next mini-batch: collected under this terrible policy.
- Not clear how to recover short of going back and shrinking the step size.
Simple Step-sizing
- Simple step-sizing: Line search in direction of gradient:
- Simple, but expensive (evaluations along the line).
- Naive: ignores where the first-order approximation is good/poor.
Trust Region Policy Optimization (TRPO)
TRPO is a method for optimizing policies that is designed to be more stable and reliable than vanilla policy gradient methods. It does this by constraining the size of the change in the policy parameters to a certain region around the current policy. This helps to ensure that we don't make changes that are too drastic, which can lead to instability in the optimization process.
Evaluating the KL
- Recall:
- When we look at the distance between distribution over trajectories, we have this product of an action given state and probability of the next state given state and action:
- Hence:
- KL between two distributions over trajectories between two policies, when expanding the dynamics cancels outs. The dynamics up front are just an expectation, which means we can just sample from the current policy.
Proximal Policy Optimization (PPO)
Proximal Policy Optimization (PPO)
PPO is a family of policy gradient methods that aim to address some of the limitations of previous methods such as TRPO, such as eliminating the second-order optimization introduced by the constraint. The core idea behind PPO is to use a clipped surrogate objective function that encourages small changes to the policy, while still allowing for larger changes if they are strongly beneficial. This helps to avoid the instability that can arise from making significant changes to the policy parameters.
A better TRPO?
- Not easy to enforce trust region constraint for complex policy architectures.
- Networks that have stochasticity like dropout.
- Parameter sharing between policy and value function.
- Conjugate Gradient implementation is complex.
- Would be good to harness good first-order optimizers like Adam, RMSProp, …
Proximal Policy Optimization V1 - “Dual Descent TRPO”
What if we move the constraint into the objective? There is a value of beta which makes the two problems equivalent, that they'll have the same solution. Now we have an optimization problem, and we can run it with Adam or another optimizer and iterate over that.