Skip to main content

GoRight Environment

Erin J Talvitie, Zilei Shao, Huiying Li, Jinghan Hu, Jacob Boerma, Rory Zhao, and Xintong Wang. "Bounding-Box Inference for Error-Aware Model-Based Reinforcement Learning." Reinforcement Learning Journal, vol. 5, 2024, pp. 2440–2460.

Environment Dynamics

Let the state at time tt be

st=(xt,yt,pt)s_t = \bigl(x_t,\,y_t,\,\mathbf{p}_t\bigr)

where

  1. Position xt{0,,10}x_t \in \{0,\dots,10\}, observed as
    x~t=xt+δx\tilde x_t = x_t + \delta_x, δx[0.25,0.25]\delta_x\in[-0.25,0.25].
    Actions at{left,right}a_t\in\{\text{left},\text{right}\} shift xt+1=xt±1x_{t+1}=x_t\pm1.
  2. Status indicator yt{0,5,10}y_t\in\{0,5,10\}, observed as
    y~t=yt+δy,  δy[1.25,1.25]\tilde y_t = y_t + \delta_y,\;\delta_y\in[-1.25,1.25].
    Follows a deterministic 2-step Markov cycle (see Talvitie et al. (2024)).
  3. Prize indicators pt{0,1}N\mathbf{p}_t \in\{0,1\}^N, N=2N=2 in GoRight,
    observed as p~ti=pti+δp,δp[0.25,0.25]\tilde p^i_t = p^i_t + \delta_{p},\,\delta_p\in[-0.25,0.25].
    All zeros except at xt=10x_t=10, where:
    • If yt=10y_t=10 upon entering xt=10x_t=10, then pt=[1,,1]\mathbf{p}_t = [1,\dots,1].
    • Otherwise the bits cycle in sequence.
  4. Reward rt={1,at=right, (unless at xt=10,  pt=1),+3,at=right  xt=10  pt=1,0,at=left.r_t = \begin{cases} -1, & a_t=\text{right},\ (\text{unless at }x_t=10,\;\mathbf{p}_t=\mathbf{1}),\\ +3, & a_t=\text{right}\ \wedge\ x_t=10\ \wedge\ \mathbf{p}_t=\mathbf{1},\\ 0, & a_t=\text{left}. \end{cases}

Although the underlying MDP is purely discrete, we add small observation noise on each dimension to encourage robustness to continuous inputs. For this demo, we expose only the noiseless values.

GoRight presents a classic sparse-reward, delayed-gratification challenge:

  • The agent must discover that only when the status indicator hits its maximum and the agent steps into the goal cell does the prize lights go fully on.
  • All other visits to (x=10) merely cycle the lights, providing a tiny hint—but no immediate large reward.
  • Once the true prize lights are on, a single “right” move repeatedly yields (+3), rewarding the agent to stay and exploit.

Algorithms must balance exploration (to hit status =10 at the correct time) and exploitation (to harvest the (+3) reward once found).

Interactive Demo

Use the buttons below to:

  • Step left/right
  • Reset the environment

Try finding the optimal policy!

GoRight

Discount:
Reward: 0.00
Current:
0
Prizes:
Steps: 0Last Reward: 0.00Discounted Return: 0.00