Bounding-Box Inference for Error-Aware Model-Based Reinforcement Learning
I replicated the results presented in this paper. The codebase can be found in this GitHub repository (in python).
The original codebase for the paper can be found in this GitHub repository (in C++).
This content is a summary and includes my personal takeaways from the paper "Bounding-Box Inference for Error-Aware Model-Based Reinforcement Learning" by Talvitie et al. It reflects the points I found most relevant, along with my own conclusions on the topics discussed.
For a more comprehensive understanding of the topics and to align with the authors' perspectives, please READ THE PAPER.
Introduction
Model-Based Reinforcement Learning (MBRL) aims to learn a model of the environment dynamics to reduce sample complexity and potentially improve policy performance. Let be a Markov Decision Process (MDP), where is the state space, is the action space, is the transition distribution, is the reward function, and is the discount factor. The goal is to find a policy that maximizes the expected discounted return .
In MBRL, the agent simultaneously learns a model that approximates , and uses it to simulate state-action transitions. Naively treating simulated rollouts as ground truth can mislead the value function and degrade the policy. This problem is particularly acute when there is insufficient model capacity, inadequate exploration of the hypothesis space, or limited training samples. The paper defines:
-
Aleatoric uncertainty as tied to the nature of the stochastic process, specifically the randomness or noise in the environment. This type of uncertainty is irreducible, meaning that no amount of additional data or model refinement can eliminate it, as it is intrinsic to the system being modeled.
-
Epistemic uncertainty as tied to the lack of knowledge or uncertainty about the environment or model parameters. Unlike aleatoric uncertainty, it is reducible, meaning it can be decreased by gathering and training the model on additional data.
-
Model inadequacy refering to a model's limitations in accurately capturing the true dynamics or complexity of the environment. This uncertainty occurs when the model is structurally flawed, overly simplistic, or based on incorrect assumptions, leading to systematic prediction errors. Unlike epistemic uncertainty, model inadequacy cannot be fully resolved by acquiring more data; it requires fundamentally revising or improving the model's structure to better reflect the underlying system.
Selective planning aims to mitigate this by weighing model-based value targets according to their estimated reliability. Let be an estimate of the state-action value function. Model-based Value Expansion (MVE) computes multi-step targets:
and for ,
where are generated by the model . By weighting these targets based on uncertainty, selective planning can exploit accurate predictions while ignoring harmful ones.
The paper present a novel uncertainty measure for MVE, Bounding-Box Inference (BBI). BBI infers upper and lower bounds on the multi-step targets, and weights down the targets according to the gap between the bounds.
Problem Setting and Mathematical Formulation
We consider an MDP with unknown dynamics and reward . At each time , the agent observes a state , selects an action , and receives a reward . The environment transitions to a new state .
The agent maintains a state-action value function aiming to approximate , where . The agent also learns a model approximating and .
Let be the weight assigned to the -step target. The MVE update:
The challenge is to define the uncertainty over the simulated targets, .
Uncertainty Measures
Several uncertainty measures are compared in the paper:
-
One-Step Predicted Variance (1SPV):
One-Step Predicted Variance estimates uncertainty by aggregating the predicted variances of one-step model predictions over the course of an -step rollout. It assumes that the total uncertainty at each horizon can be represented as a linear sum of independent variances from individual state and reward dimensions. Specifically, the uncertainty at horizon is given by:where represents the predicted variance of the next state dimension , and represents the predicted variance of the reward at state and action .
While simple and computationally efficient, this method has key limitations. The assumption that state uncertainties directly approximate TD target uncertainties can lead to overly conservative estimates, especially when state dimensions with high variance are irrelevant to the value function.
-
Monte Carlo Target Variance (MCTV):
Monte Carlo Target Variance takes a direct approach to estimating uncertainty by sampling multiple rollouts from the model for the same initial transition. Given sampled TD targets at horizon , the mean target is defined as:
and the uncertainty is estimated as the sample variance: