Model-based RL

Work in Progress

These are my rough notes.

Vanilla algorithm for Model-based RL

For $\text{iter} = 1,2, \cdots$ $iter = 1, 2, \dots$ :
- Collect data under current policy.
- Learn dynamics model from past data.
- Improve policy by using the dynamics model (either by backprop-through-time through the learned model, or by using the learning model as a sim to run RL).

We need to iterate in this loop to gather more data with different policies and remove biases introduced by previous policies.

Why Model-Based RL

Anticipate data-efficiency
- Get model out of data, which might allow for more significant policy updates than just a policy gradient.
Learning a model
- Re-usable for other tasks [assuming general enough].

MB-RL has much better sample efficiency so why is it not used all the time?

If we have access to a simulator, we don't need to run policy in a learned simulator (i.e. Atari, Robotics, …).
Training instability (we'll take a look at one approach to address that (ME-TRPO)).
Not achieving the same asymptotic performance as model-free methods (to address that: MB-MPO).

Robust Model-based RL: Model-Ensemble TRPO (ME-TRPO)

ME-TRPO is a method for addressing the issue of overfitting in model-based RL. It uses an ensemble of dynamics models to handle regions where insufficient data is available to train the model. By leveraging the regions where there is more disagreement among the models, ME-TRPO can collect more data to improve the ensemble.

This was done with TRPO because it was the SOTA at the time of the work, now it could be model-ensemble PPO instead.

Overfitting in Model-based RL

Standard overfitting (in supervised learning):
- The neural network performs well on training data, but poorly on test data.
  - E.g. on prediction of $s_\text{next}$ from $(s,a)$ .
The new overfitting challenge in Model-based RL:
- Policy optimization tends to exploit regions where insufficient data is available to train the model, leading to catastrophic failures, “model-bias” (Deisenroth & Rasmussen, 2011; Schneider, 1997; Atkeson & Santamaria, 1997).
- Proposed fix: Model-Ensemble Trust Region Policy Optimization (ME-TRPO).

Untitled

Whenever there is enough data to support an accurate model, the ensemble of dynamics models will typically agree. When there is not enough data to support an accurate model, the models will typically disagree, the models are initialized differently and will make different predictions because there isn't enough data. Thanks to that, we can leverage the regions when there is more disagreement and collect more data to improve the ensemble.

Untitled

Adaptive Model-based RL: Model-based Meta-Policy Optimization (MB-MPO)

MB-MPO is an extension of model-based RL that uses meta-learning to improve sample efficiency. The agent learns a model of the environment and uses it to simulate trajectories. These trajectories are then used to train a meta-policy that can quickly adapt to new tasks.

The main advantage of MB-MPO is its ability to quickly adapt to new tasks with minimal data.

Model-based RL Asymptotic Performance

Because learned (ensemble of) model imperfect:
- Resulting policy good in simulation(s), but not optimal in the real world.
Attempted Fix 1: learn better dynamics model:
- Such efforts have so far proven insufficient.
Attempted Fix 2: model-based RL via meta-policy optimization (MB-MPO).
- Key Ideia:
  - Learn an ensemble of models representative of generally how the real world works.
  - Learn an adaptive policy that can quickly adapt to any of the learned models.
  - Such adaptive policy can quickly adapt to how the real world works.

Vanilla algorithm for Model-based RL​

Why Model-Based RL​

Robust Model-based RL: Model-Ensemble TRPO (ME-TRPO)

Overfitting in Model-based RL​

Adaptive Model-based RL: Model-based Meta-Policy Optimization (MB-MPO)

Model-based RL Asymptotic Performance​

Vanilla algorithm for Model-based RL

Why Model-Based RL

Overfitting in Model-based RL

Model-based RL Asymptotic Performance