Bounding-Box Inference for Error-Aware Model-Based Reinforcement Learning
I replicated the results presented in this paper. The codebase can be found in this GitHub repository (in python).
The original codebase for the paper can be found in this GitHub repository (in C++).
This content is a summary and includes my personal takeaways from the paper "Bounding-Box Inference for Error-Aware Model-Based Reinforcement Learning" by Talvitie et al. It reflects the points I found most relevant, along with my own conclusions on the topics discussed.
For a more comprehensive understanding of the topics and to align with the authors' perspectives, please READ THE PAPER.
Introduction
Model-Based Reinforcement Learning (MBRL) aims to learn a model of the environment dynamics to reduce sample complexity and potentially improve policy performance. Let
be a Markov Decision Process (MDP), where is the state space, is the action space, is the transition distribution, is the reward function, and is the discount factor. The goal is to find a policy that maximizes the expected discounted return
In MBRL, the agent simultaneously learns a model that approximates and uses it to simulate state-action transitions. Naively treating simulated rollouts as ground truth can mislead the value function and degrade the policy. This problem is particularly acute when the model has insufficient capacity, when the hypothesis space is inadequately explored, or when training samples are limited.
The paper defines:
-
Aleatoric uncertainty as the inherent randomness in the environment. This uncertainty is irreducible—no amount of additional data or model refinement can eliminate it.
-
Epistemic uncertainty as the uncertainty over the model parameters due to limited data. Unlike aleatoric uncertainty, epistemic uncertainty is reducible through additional data.
-
Model inadequacy as the error arising from a model’s structural limitations or oversimplifications. This type of uncertainty cannot be fully resolved merely by acquiring more data; it requires a fundamental revision of the model.
Selective planning aims to mitigate catastrophic planning by weighing model-based value targets according to their estimated reliability.
The paper uses the Model-based Value Expansion (MVE), which computes multi-step targets:
and for ,
where are generated by the model. By weighting these targets based on uncertainty, selective planning can exploit accurate predictions while ignoring harmful ones.
The paper present a novel uncertainty measure for MVE—Bounding-Box Inference (BBI). BBI infers upper and lower bounds on the multi-step targets and weights them according to the gap between these bounds.
Problem Setting and Mathematical Formulation
We consider an MDP with unknown dynamics and reward . At each time , the agent observes a state , selects an action , and receives a reward . The environment transitions to a new state .
The agent maintains a state-action value function to approximate
where
The agent also learns a model approximating and .
Selective MVE Update
Let be the weight assigned to the -step target. The selective MVE update is given by:
Let be the uncertainty measure associated with horizon . We form weights via a softmin:
where is a temperature parameter. As , the update approaches Q-learning; as , it approaches uniform weighting.
The challenge is to define the uncertainty over the simulated targets, .
Uncertainty Measures
Several uncertainty measures are compared in the paper:
-
One-Step Predicted Variance (1SPV):
1SPV estimates uncertainty by aggregating the predicted variances of one-step model predictions over the rollout. Specifically, the uncertainty at horizon is given by:where represents the predicted variance of the next state dimension , and represents the predicted variance of the reward at state and action . Although computationally efficient, this method can be overly conservative when state uncertainties do not translate directly to uncertainty in TD targets.
-
Monte Carlo Target Variance (MCTV):
This approach estimates uncertainty by sampling rollouts from the model for the same initial transition. Given samples at horizon , the mean target is defined as:
and the uncertainty is estimated as the sample variance:
This method directly estimates uncertainty over TD targets, though its accuracy depends on the number of samples and the model’s predicted distribution.
-
Monte Carlo Target Range (MCTR):
Instead of using variance, MCTR defines uncertainty as the range of the sampled targets:
This measure directly captures the spread of possible outcomes and is less sensitive to assumptions about the distribution’s shape, though it still depends on the number of samples.
-
Bounding-Box Inference (BBI):
BBI uses a bounding-box over states and actions to infer upper and lower bounds on the TD targets. Consider a bounding-box over states and over actions . Define: