In policy gradient methods, the agent is really a policy πθ(u∣s) that chooses action u given the current input s. For us, it will be a neural network, with parameters θ.
Figure 1: Actor-environment interaction in policy gradient methods.
Control policy parameterized by parameter vector θ:
θmaxE[t=0∑HR(st∣πθ)]
Stochastic policy class (smooths out the problem):
We let τ denote a state-action sequence s0,uo,…,sH,uH, so we overload notation: R(τ)=∑t=0HR(st,ut). Meaning, the reward for the trajectory τ is a sum of the rewards for each state-action pair.
We optimize for the utility, for the parameter vector θ:
U(θ)=E[t=0∑HR(st,ut;πθ)]=τ∑P(τ;θ)R(τ)
The sum of the rewards is weighted by its probability, given the parameters θ, for every possible trajectory.
Our goal is to find θ:
θmaxU(θ)=θmaxτ∑P(τ;θ)R(τ)
Some settings favor some trajectories, so we have to find the settings that maximize the expected reward.
Taking the gradient of U(θ) with relation to θ gives:
We have an expectation so we can compute a sample-based approximation. We don't have to enumerate all possible trajectories, which would be impossible in realistic problems.
Approximate with the empirical estimate for m sample paths under policy πθ:
∇θU(θ)≈g^=m1i=1∑m∇θlogP(τ(i);θ)R(τ(i))
So, to get our gradient estimate, we're gonna use our current setting of the parameters, do a bunch of rollouts, and then for each rollout, we're going to compute grad log probability of the trajectory under the current parameter setting times the reward collected along that trajectory.
The intuition is that by controlling parameter setting θ we can increase the probability of high reward trajectories, and decrease low reward trajectories.
The grad log probability of trajectory under the current parameter vector θ of the policy under the hood is a product of probabilities, the probability of the next state, given current state and action, and the probability of action given state.
∇θlogP(τ(i);θ)=∇θlog[t=0∏Hdynamics modelP(st+1(i)∣st(i),ut(i))⋅policyπθ(ut(i)∣st(i))]=∇θ[t=0∑HlogP(st+1(i)∣st(i),ut(i))+t=0∑Hlogπθ(ut(i)∣st(i))]=∇θt=0∑Hlogπθ(ut(i)∣st(i))=t=0∑Hno dynamics model required!∇θlogπθ(ut(i)∣st(i))
Because the first sum doesn't have a dependency on θ, the gradient has no dynamics model contribution.
And the first part we discovered that we can compute without access to a dynamics model, we can do this directly based on the neural network that represents our policy.
∇θlogP(τ(i);θ)=t=0∑Hno dynamics model required!∇θlogπθ(ut(i)∣st(i))
We can rollout the current policy, then get the rewards along each trajectory, and we know the state and actions along these trajectories. So for each state and action experienced, we compute grad log probability of the action given state and then accumulate the grad log probability of the trajectory, multiply with your reward along that trajectory, and you have your gradient.
An unbiased estimate of the gradient:
E[g^]=∇θU(θ)
As formulated thus far: unbiased but very noisy. It's sample-based and if you don't have enough samples, it's not going to be very precise. Fixes that lead to real-world practicality:
E[∇θlogP(τ;θ)b]=τ∑P(τ;θ)∇θlogP(τ;θ)b=τ∑P(τ;θ)P(τ;θ)∇θP(τ;θ)b=τ∑∇θP(τ;θ)b=∇θ(τ∑P(τ)b)=b∇θ(τ∑P(τ))↪As long as baseline doesn’t depend on action in logprob (action)=b×0
On expectation, the extra term will be zero. So why care to add the term b? When there are finite samples, the estimate we're accumulating will have a variance reduction effect.
Splitting the reward into reward from the past and reward from the future, and rewards from the past are not relevant, and doing the math carefully we see that the expected value is zero. Hence, we remove terms that don't contribute to remove variance and we have the practical equation for policy gradient:
This is a weighted average of the reward by the square grad log of probability, hence, we'll weight more heavily the high probability trajectories. Not seen so many.
Time-dependent baseline:
bt=m1i=1∑mt=0∑H−1R(sk(i),uk(i))
It captures the fact that maybe at the end of the trajectory there are fewer rewards than at the beginning.
State-dependent expected return:
b(st)=[rt+rt+1+rt+2+...+rH−1]=Vπ(st)
This increases log prob of action proportionally to how much its returns are better than the expected return under the current policy.
The last term is regularization, to penalize parameters to jump too far from the previous ones. This is typically more sample-efficient, but less stable, so a common practice is to start with the Monte Carlo Estimation approach.
The advantage is the difference between the reward experienced and the expected reward. But the reward experience is still a single sample estimate, it's the estimation of Q from a single roll-out:
Qπ(s,u)=E[r0+r1+r2,...∣s0=s,a0=a]
In this way, we have a high variance per sample base. We could reduce the variance in both of the following ways:
Reduce variance by discounting.
Reduce variance by function approximation (critic).
Introduce discount factor as a hyperparameter to improve the estimate of Q:
Qπ,γ(s,u)=E[r0+γr1+γ2r2+...∣s0=s,a0=a]
The discount factor weights the importance of immediate and long-term rewards. We also have to discount the baseline, because the effectiveness of an action has more impact in the short-term.
When we use the value function to approximate part of the Q-value, we introduce some error, some bias, but then we reduce the variance, because the V-value is an estimate based on many past experiences. This is a bias-variance trade-off, but the advantage is that with less variance we help training to run faster.