Double GoRight Environment

This environment extends the GoRight (Talvitie et al., 2024) demo by introducing a second prize location farther to the right.

Environment Dynamics

Let the state at time $t$ be

s_t = \bigl(x_t,\,y_t,\,\mathbf{p}_t\bigr)

where:

Position $x_t \in \{1,\dots,21\}$ , observed as
$\tilde x_t = x_t + \delta_x,\ \delta_x\!\in[-0.25,0.25]$ .
Actions $a_t\in\{\text{left},\text{right}\}$ move the agent by $\pm1$ ,
clamped so $x_{t+1}\in\{1,\dots,21\}$ .
Status indicator $y_t\in\{0,5,10\}$ , observed as
$\tilde y_t = y_t + \delta_y,\ \delta_y\!\in[-1.25,1.25]$ .
Follows the same deterministic 2-step cycle as in GoRight.
Prize indicators $\mathbf{p}_t \in \{0,1\}^2$ , observed as
$\tilde p^i_t = p^i_t + \delta_{p},\ \delta_p\!\in[-0.25,0.25]$ .
These two lights are always zero except when the agent “triggers”
a prize location:
- First prize at $x=10$ : if the agent enters $x=10$ with $y_t=10$ ,
  then $\mathbf{p}_t=[1,1]$ .
- Second prize at $x=20$ : similarly, entering $x=20$ with $y_t=10$
  sets $\mathbf{p}_t=[1,1]$ .
  Otherwise, when at a prize position without having $y_t=10$ , the lights
  cycle: $[1,0]\to[0,1]\to[1,0]\to\cdots$ .
Rewards
$r_t = \begin{cases} 0, & a_t=\text{left},\\ -1, & a_t=\text{right},\ \text{unless at a prize with lights on},\\ +3, & a_t=\text{right}\ \wedge\ x_t=10\ \wedge\ \mathbf{p}_t=[1,1],\\ +6, & a_t=\text{right}\ \wedge\ x_t=20\ \wedge\ \mathbf{p}_t=[1,1]. \end{cases}$
When the agent takes the rewarded “right” action at either prize, it stays
in place and continues to receive the same reward on subsequent rights.

Although the MDP is discrete, we add small observation noise to encourage
robustness; the demo exposes the noiseless values for clarity.

Double GoRight requires the agent to discover two “hidden” prize locations:

First hurdle (at $x=10$ ): you must hit the status peak exactly when stepping in to unlock the first reward of +3 repeatedly.
Second hurdle (at $x=20$ ): once past the first, you face a farther goal—again tied to the same status‐peak condition—but now yielding +6.

This layered setup tests exploration strategies that can plan through two sparse rewards with one being a distraction to the other.

Use the buttons below to:

Try finding the optimal policy!

Current Status Previous Status

Discount:

Reward: 0.00

Current Status Indicator:

Prize Indicators:

Steps: 0Discounted Return: 0.00