Skip to main content

Bounding-Box Inference for Error-Aware Model-Based Reinforcement Learning

Talvitie, Erin J., et al. “Bounding-Box Inference for Error-Aware Model-Based Reinforcement Learning.” Reinforcement Learning Conference, 2024.

Info

I replicated the results presented in this paper. The codebase can be found in this GitHub repository (in python).

The original codebase for the paper can be found in this GitHub repository (in C++).

warning

This content is a summary and includes my personal takeaways from the paper "Bounding-Box Inference for Error-Aware Model-Based Reinforcement Learning" by Talvitie et al. It reflects the points I found most relevant, along with my own conclusions on the topics discussed.

For a more comprehensive understanding of the topics and to align with the authors' perspectives, please READ THE PAPER.

Introduction

Model-Based Reinforcement Learning (MBRL) aims to learn a model of the environment dynamics to reduce sample complexity and potentially improve policy performance. Let M=(S,A,p,r,γ)\mathcal{M} = (\mathcal{S}, \mathcal{A}, p, r, \gamma) be a Markov Decision Process (MDP), where S\mathcal{S} is the state space, A\mathcal{A} is the action space, p(ss,a)p(s'|s,a) is the transition distribution, r(s,a)r(s,a) is the reward function, and γ[0,1]\gamma \in [0,1] is the discount factor. The goal is to find a policy π(as)\pi(a|s) that maximizes the expected discounted return Gt=i=0γiRt+i+1G_t = \sum_{i=0}^{\infty} \gamma^i R_{t+i+1}.

In MBRL, the agent simultaneously learns a model (p^,r^)(\hat{p}, \hat{r}) that approximates (p,r)(p, r), and uses it to simulate state-action transitions. Naively treating simulated rollouts as ground truth can mislead the value function and degrade the policy. This problem is particularly acute when there is insufficient model capacity, inadequate exploration of the hypothesis space, or limited training samples. The paper defines:

  • Aleatoric uncertainty as tied to the nature of the stochastic process, specifically the randomness or noise in the environment. This type of uncertainty is irreducible, meaning that no amount of additional data or model refinement can eliminate it, as it is intrinsic to the system being modeled.

  • Epistemic uncertainty as tied to the lack of knowledge or uncertainty about the environment or model parameters. Unlike aleatoric uncertainty, it is reducible, meaning it can be decreased by gathering and training the model on additional data.

  • Model inadequacy refering to a model's limitations in accurately capturing the true dynamics or complexity of the environment. This uncertainty occurs when the model is structurally flawed, overly simplistic, or based on incorrect assumptions, leading to systematic prediction errors. Unlike epistemic uncertainty, model inadequacy cannot be fully resolved by acquiring more data; it requires fundamentally revising or improving the model's structure to better reflect the underlying system.

Selective planning aims to mitigate this by weighing model-based value targets according to their estimated reliability. Let q^(s,a)\hat{q}(s,a) be an estimate of the state-action value function. Model-based Value Expansion (MVE) computes multi-step targets:

ρ^1=rt+1+γmaxaq^(St+1,a),\hat{\rho}_1 = r_{t+1} + \gamma \max_{a'}\hat{q}(S_{t+1}, a'),

and for h>1h>1,

ρ^h=rt+1+i=2hγi1r^t+i+γhmaxaq^(S^t+h,a),\hat{\rho}_h = r_{t+1} + \sum_{i=2}^{h}\gamma^{i-1}\hat{r}_{t+i} + \gamma^h \max_{a'} \hat{q}(\hat{S}_{t+h}, a'),

where (s^t+i,r^t+i)(\hat{s}_{t+i}, \hat{r}_{t+i}) are generated by the model (p^,r^)(\hat{p}, \hat{r}). By weighting these targets based on uncertainty, selective planning can exploit accurate predictions while ignoring harmful ones.

The paper present a novel uncertainty measure for MVE, Bounding-Box Inference (BBI). BBI infers upper and lower bounds on the multi-step targets, and weights down the targets according to the gap between the bounds.

Problem Setting and Mathematical Formulation

We consider an MDP with unknown dynamics pp and reward rr. At each time tt, the agent observes a state StSS_t \in \mathcal{S}, selects an action AtAA_t \in \mathcal{A}, and receives a reward Rt+1=r(St,At)R_{t+1}=r(S_t,A_t). The environment transitions to a new state St+1p(St,At)S_{t+1} \sim p(\cdot|S_t,A_t).

The agent maintains a state-action value function q^(s,a)\hat{q}(s,a) aiming to approximate q(s,a)=maxπqπ(s,a)q_{*}(s,a) = \max_\pi q_{\pi}(s,a), where qπ(s,a)=E[i=0γiRt+i+1St=s,At=a,π]q_{\pi}(s,a) = \mathbb{E}[\sum_{i=0}^{\infty}\gamma^i R_{t+i+1}|S_t=s,A_t=a,\pi]. The agent also learns a model (p^,r^)(\hat{p}, \hat{r}) approximating pp and rr.

Let wh0w_h \geq 0 be the weight assigned to the hh-step target. The MVE update:

q^(St,At)q^(St,At)+α(h=1Hwhρ^hh=1Hwhq^(St,At)).\hat{q}(S_t,A_t) \leftarrow \hat{q}(S_t,A_t) + \alpha\left(\sum_{h=1}^H\frac{w_h \hat{\rho}_h}{\sum_{h'=1}^H w_{h'}}- \hat{q}(S_t,A_t)\right).

The challenge is to define the uncertainty over the simulated targets, ρ^h\hat{\rho}_h.

Uncertainty Measures

Several uncertainty measures are compared in the paper:

  1. One-Step Predicted Variance (1SPV):
    One-Step Predicted Variance estimates uncertainty by aggregating the predicted variances of one-step model predictions over the course of an hh-step rollout. It assumes that the total uncertainty at each horizon hh can be represented as a linear sum of independent variances from individual state and reward dimensions. Specifically, the uncertainty at horizon hh is given by:

    uh=j=0h1(dσ^d2(st+j,at+j)+σ^r2(st+j,at+j))2,u_h = \sum_{j=0}^{h-1} \left( \sum_d \hat{\sigma}_d^2(s_{t+j}, a_{t+j}) + \hat{\sigma}_r^2(s_{t+j}, a_{t+j}) \right)^2,

    where σ^d2(s,a)\hat{\sigma}_d^2(s, a) represents the predicted variance of the next state dimension dd, and σ^r2(s,a)\hat{\sigma}_r^2(s, a) represents the predicted variance of the reward at state ss and action aa.

    While simple and computationally efficient, this method has key limitations. The assumption that state uncertainties directly approximate TD target uncertainties can lead to overly conservative estimates, especially when state dimensions with high variance are irrelevant to the value function.

  2. Monte Carlo Target Variance (MCTV):

    Monte Carlo Target Variance takes a direct approach to estimating uncertainty by sampling multiple rollouts from the model for the same initial transition. Given kk sampled TD targets {ρ^hj}j=1k\{\hat{\rho}^j_h\}_{j=1}^k at horizon hh, the mean target is defined as:

    ρ^h=1kj=1kρ^hj,\hat{\rho}_h = \frac{1}{k} \sum_{j=1}^k \hat{\rho}^j_h,

    and the uncertainty is estimated as the sample variance:

    uh=1k1j=1k(ρ^hjρ^h)2.u_h = \frac{1}{k-1} \sum_{j=1}^k (\hat{\rho}^j_h - \hat{\rho}_h)^2.

    This method provides a principled estimate of uncertainty directly over the TD targets. However, it is sensitive to the underlying model’s predicted probability distribution. If the model misrepresents the dynamics or assigns inappropriate probabilities to certain transitions, the sample variance may fail to reflect true model inaccuracies. Moreover, the accuracy of this method depends on the number of Monte Carlo samples (kk), which introduces a trade-off between computational cost and the reliability of the estimate.

  3. Monte Carlo Target Range (MCTR):

    Monte Carlo Target Range refines the idea of using samples from the model but replaces variance with range as the measure of uncertainty. Given the same kk sampled TD targets {ρ^hj}j=1k\{\hat{\rho}^j_h\}_{j=1}^k at horizon hh, the uncertainty is defined as:

    uh=maxjρ^hjminjρ^hj.u_h = \max_j \hat{\rho}^j_h - \min_j \hat{\rho}^j_h.

    This range-based approach directly captures the spread of possible outcomes without relying on assumptions about the distribution’s shape. However, like MCTV, MCTR still depends on the model’s predicted distribution. If the model assigns negligible probabilities to critical transitions, the range may underestimate true uncertainty. Additionally, the estimate can be sensitive to the number of samples (kk) used, as a small kk may fail to capture the full range of possible TD targets.

  4. Bounding-Box Inference (BBI):

    A novel uncertainty measure that relies on bounding-box approximations of possible outcomes. Consider a bounding-box over states s=[s,s]\overline{\underline{s}} = [\underline{s}, \overline{s}] and over actions a\overline{\underline{a}}. Define:

    q(st,at)=infsst,aatq^(s,a),q(st,at)=supsst,aatq^(s,a).\underline{q}(\overline{\underline{s}}_t,\overline{\underline{a}}_t) = \inf_{s\in\overline{\underline{s}}_t,a\in\overline{\underline{a}}_t}\hat{q}(s,a), \quad \overline{q}(\overline{\underline{s}}_t,\overline{\underline{a}}_t)=\sup_{s\in\overline{\underline{s}}_t,a\in\overline{\underline{a}}_t}\hat{q}(s,a).

    Similarly, given the current state and action (St,At)(S_t,A_t) and the actual next transition (St+1,Rt+1)(S_{t+1}, R_{t+1}), we define a bounding-box model rollout:

    rt+i,st+i,at+i,for i=2,,h.\overline{\underline{r}}_{t+i}, \overline{\underline{s}}_{t+i}, \overline{\underline{a}}_{t+i}, \quad \text{for } i=2,\dots,h.

    The bounding-box TD target at horizon hh:

    ρh=Rt+1+i=2hγi1rt+i+γhv(st+h),\underline{\rho}_h = R_{t+1} + \sum_{i=2}^h \gamma^{i-1}\underline{r}_{t+i} + \gamma^h \underline{v}(\overline{\underline{s}}_{t+h}), ρh=Rt+1+i=2hγi1rt+i+γhv(st+h),\overline{\rho}_h = R_{t+1} + \sum_{i=2}^h \gamma^{i-1}\overline{r}_{t+i} + \gamma^h \overline{v}(\overline{\underline{s}}_{t+h}),

    where v(s)=maxaq(s,a)\underline{v}(\overline{\underline{s}}) = \max_a \underline{q}(\overline{\underline{s}},a) and similarly for v(s)\overline{v}(\overline{\underline{s}}).

    The BBI uncertainty is uh=ρhρhu_h = \overline{\rho}_h - \underline{\rho}_h. This approach is distribution-insensitive and relies on conservative bounding, ensuring that no harmful scenario is overlooked. Though potentially conservative, BBI avoids catastrophic failures due to model misestimation.

Selective MVE Update

Let uhu_h be the uncertainty measure associated with horizon hh. We form weights via a softmin:

wh=exp(uh/τ)j=1Hexp(uj/τ),w_h = \frac{\exp(-u_h/\tau)}{\sum_{j=1}^H \exp(-u_j/\tau)},

where τ>0\tau>0 is a temperature parameter. As τ0\tau \rightarrow 0, the update approaches Q-learning; as τ\tau \rightarrow \infty, it approaches uniform weighting.

Empirical Evaluation

WIP

GoRight Environment

Experimental Setup

Baselines and Models

  • Q-learning:
  • Perfect Model:
  • Expectation Model:
  • Sampling Model:
  • Linear Model:
  • Regression Tree:
  • Neural Network:

Results

Discussion and Conclusions