Skip to main content

Double GoRight Environment

This environment extends the GoRight (Talvitie et al., 2024) demo by introducing a second prize location farther to the right.

Environment Dynamics

Let the state at time tt be

st=(xt,yt,pt)s_t = \bigl(x_t,\,y_t,\,\mathbf{p}_t\bigr)

where:

  1. Position xt{1,,21}x_t \in \{1,\dots,21\}, observed as
    x~t=xt+δx, δx ⁣[0.25,0.25]\tilde x_t = x_t + \delta_x,\ \delta_x\!\in[-0.25,0.25].
    Actions at{left,right}a_t\in\{\text{left},\text{right}\} move the agent by ±1\pm1,
    clamped so xt+1{1,,21}x_{t+1}\in\{1,\dots,21\}.

  2. Status indicator yt{0,5,10}y_t\in\{0,5,10\}, observed as
    y~t=yt+δy, δy ⁣[1.25,1.25]\tilde y_t = y_t + \delta_y,\ \delta_y\!\in[-1.25,1.25].
    Follows the same deterministic 2-step cycle as in GoRight.

  3. Prize indicators pt{0,1}2\mathbf{p}_t \in \{0,1\}^2, observed as
    p~ti=pti+δp, δp ⁣[0.25,0.25]\tilde p^i_t = p^i_t + \delta_{p},\ \delta_p\!\in[-0.25,0.25].
    These two lights are always zero except when the agent “triggers”
    a prize location:

    • First prize at x=10x=10: if the agent enters x=10x=10 with yt=10y_t=10,
      then pt=[1,1]\mathbf{p}_t=[1,1].
    • Second prize at x=20x=20: similarly, entering x=20x=20 with yt=10y_t=10
      sets pt=[1,1]\mathbf{p}_t=[1,1].
      Otherwise, when at a prize position without having yt=10y_t=10, the lights
      cycle: [1,0][0,1][1,0][1,0]\to[0,1]\to[1,0]\to\cdots.
  4. Rewards

    rt={0,at=left,1,at=right, unless at a prize with lights on,+3,at=right  xt=10  pt=[1,1],+6,at=right  xt=20  pt=[1,1].r_t = \begin{cases} 0, & a_t=\text{left},\\ -1, & a_t=\text{right},\ \text{unless at a prize with lights on},\\ +3, & a_t=\text{right}\ \wedge\ x_t=10\ \wedge\ \mathbf{p}_t=[1,1],\\ +6, & a_t=\text{right}\ \wedge\ x_t=20\ \wedge\ \mathbf{p}_t=[1,1]. \end{cases}

    When the agent takes the rewarded “right” action at either prize, it stays
    in place and continues to receive the same reward on subsequent rights.

Although the MDP is discrete, we add small observation noise to encourage
robustness; the demo exposes the noiseless values for clarity.

Double GoRight requires the agent to discover two “hidden” prize locations:

  • First hurdle (at x=10x=10): you must hit the status peak exactly when stepping in to unlock the first reward of +3 repeatedly.
  • Second hurdle (at x=20x=20): once past the first, you face a farther goal—again tied to the same status‐peak condition—but now yielding +6.

This layered setup tests exploration strategies that can plan through two sparse rewards with one being a distraction to the other.

Interactive Demo

Use the buttons below to:

  • Step left/right
  • Reset the environment

Try finding the optimal policy!

Double GoRight

Discount:
Reward: 0.00
Status Indicator:
10
Prize Indicators:
Steps: 0Discounted Return: 0.00