Skip to main content

Model-based RL

Work in Progress

These are my rough notes.

Vanilla algorithm for Model-based RL

  • For iter=1,2,\text{iter} = 1,2, \cdots:
    • Collect data under current policy.
    • Learn dynamics model from past data.
    • Improve policy by using the dynamics model (either by backprop-through-time through the learned model, or by using the learning model as a sim to run RL).

We need to iterate in this loop to gather more data with different policies and remove biases introduced by previous policies.

Why Model-Based RL

  • Anticipate data-efficiency
    • Get model out of data, which might allow for more significant policy updates than just a policy gradient.
  • Learning a model
    • Re-usable for other tasks [assuming general enough].

MB-RL has much better sample efficiency so why is it not used all the time?

  • If we have access to a simulator, we don't need to run policy in a learned simulator (i.e. Atari, Robotics, …).
  • Training instability (we'll take a look at one approach to address that (ME-TRPO)).
  • Not achieving the same asymptotic performance as model-free methods (to address that: MB-MPO).

Robust Model-based RL: Model-Ensemble TRPO (ME-TRPO)

ME-TRPO is a method for addressing the issue of overfitting in model-based RL. It uses an ensemble of dynamics models to handle regions where insufficient data is available to train the model. By leveraging the regions where there is more disagreement among the models, ME-TRPO can collect more data to improve the ensemble.

This was done with TRPO because it was the SOTA at the time of the work, now it could be model-ensemble PPO instead.

Overfitting in Model-based RL

  • Standard overfitting (in supervised learning):
    • The neural network performs well on training data, but poorly on test data.
      • E.g. on prediction of snexts_\text{next} from (s,a)(s,a).
  • The new overfitting challenge in Model-based RL:
    • Policy optimization tends to exploit regions where insufficient data is available to train the model, leading to catastrophic failures, “model-bias” (Deisenroth & Rasmussen, 2011; Schneider, 1997; Atkeson & Santamaria, 1997).
    • Proposed fix: Model-Ensemble Trust Region Policy Optimization (ME-TRPO).

Untitled

Whenever there is enough data to support an accurate model, the ensemble of dynamics models will typically agree. When there is not enough data to support an accurate model, the models will typically disagree, the models are initialized differently and will make different predictions because there isn't enough data. Thanks to that, we can leverage the regions when there is more disagreement and collect more data to improve the ensemble.

Untitled

Untitled

Adaptive Model-based RL: Model-based Meta-Policy Optimization (MB-MPO)

MB-MPO is an extension of model-based RL that uses meta-learning to improve sample efficiency. The agent learns a model of the environment and uses it to simulate trajectories. These trajectories are then used to train a meta-policy that can quickly adapt to new tasks.

The main advantage of MB-MPO is its ability to quickly adapt to new tasks with minimal data.

Model-based RL Asymptotic Performance

  • Because learned (ensemble of) model imperfect:

    • Resulting policy good in simulation(s), but not optimal in the real world.
  • Attempted Fix 1: learn better dynamics model:

    • Such efforts have so far proven insufficient.
  • Attempted Fix 2: model-based RL via meta-policy optimization (MB-MPO).

    • Key Ideia:
      • Learn an ensemble of models representative of generally how the real world works.
      • Learn an adaptive policy that can quickly adapt to any of the learned models.
      • Such adaptive policy can quickly adapt to how the real world works.

    Untitled

    Untitled

    Untitled