GoRight Environment

Erin J Talvitie, Zilei Shao, Huiying Li, Jinghan Hu, Jacob Boerma, Rory Zhao, and Xintong Wang. "Bounding-Box Inference for Error-Aware Model-Based Reinforcement Learning." Reinforcement Learning Journal, vol. 5, 2024, pp. 2440–2460.

Environment Dynamics

Let the state at time $t$ be

s_t = \bigl(x_t,\,y_t,\,\mathbf{p}_t\bigr)

where

Position $x_t \in \{0,\dots,10\}$ , observed as
$\tilde x_t = x_t + \delta_x$ , $\delta_x\in[-0.25,0.25]$ .
Actions $a_t\in\{\text{left},\text{right}\}$ shift $x_{t+1}=x_t\pm1$ .
Status indicator $y_t\in\{0,5,10\}$ , observed as
$\tilde y_t = y_t + \delta_y,\;\delta_y\in[-1.25,1.25]$ .
Follows a deterministic 2-step Markov cycle (see Talvitie et al. (2024)).
Prize indicators $\mathbf{p}_t \in\{0,1\}^N$ $p_{t} \in {0, 1}^{N}$ , $N=2$ $N = 2$ in GoRight,
observed as $\tilde p^i_t = p^i_t + \delta_{p},\,\delta_p\in[-0.25,0.25]$ $\tilde{p}_{t}^{i} = p_{t}^{i} + δ_{p}, δ_{p} \in [- 0.25, 0.25]$ .
All zeros except at $x_t=10$ $x_{t} = 10$ , where:
- If $y_t=10$ upon entering $x_t=10$ , then $\mathbf{p}_t = [1,\dots,1]$ .
- Otherwise the bits cycle in sequence.
Reward $r_t = \begin{cases} -1, & a_t=\text{right},\ (\text{unless at }x_t=10,\;\mathbf{p}_t=\mathbf{1}),\\ +3, & a_t=\text{right}\ \wedge\ x_t=10\ \wedge\ \mathbf{p}_t=\mathbf{1},\\ 0, & a_t=\text{left}. \end{cases}$

Although the underlying MDP is purely discrete, we add small observation noise on each dimension to encourage robustness to continuous inputs. For this demo, we expose only the noiseless values.

GoRight presents a classic sparse-reward, delayed-gratification challenge:

The agent must discover that only when the status indicator hits its maximum and the agent steps into the goal cell does the prize lights go fully on.
All other visits to (x=10) merely cycle the lights, providing a tiny hint—but no immediate large reward.
Once the true prize lights are on, a single “right” move repeatedly yields (+3), rewarding the agent to stay and exploit.

Algorithms must balance exploration (to hit status =10 at the correct time) and exploitation (to harvest the (+3) reward once found).

Interactive Demo

Use the buttons below to:

Step left/right
Reset the environment

Try finding the optimal policy!

GoRight

Current StatusPrevious Status

Discount:

Reward: 0.00

Current:

Prizes:

Steps: 0Last Reward: 0.00Discounted Return: 0.00

Environment Dynamics​

Interactive Demo​

GoRight

Environment Dynamics

Interactive Demo