Double GoRight Environment
This environment extends the GoRight (Talvitie et al., 2024) demo by introducing a second prize location farther to the right.
Environment Dynamics
Let the state at time be
where:
-
Position , observed as
.
Actions move the agent by ,
clamped so . -
Status indicator , observed as
.
Follows the same deterministic 2-step cycle as in GoRight. -
Prize indicators , observed as
.
These two lights are always zero except when the agent “triggers”
a prize location:- First prize at : if the agent enters with ,
then . - Second prize at : similarly, entering with
sets .
Otherwise, when at a prize position without having , the lights
cycle: .
- First prize at : if the agent enters with ,
-
Rewards
When the agent takes the rewarded “right” action at either prize, it stays
in place and continues to receive the same reward on subsequent rights.
Although the MDP is discrete, we add small observation noise to encourage
robustness; the demo exposes the noiseless values for clarity.
Double GoRight requires the agent to discover two “hidden” prize locations:
- First hurdle (at ): you must hit the status peak exactly when stepping in to unlock the first reward of +3 repeatedly.
- Second hurdle (at ): once past the first, you face a farther goal—again tied to the same status‐peak condition—but now yielding +6.
This layered setup tests exploration strategies that can plan through two sparse rewards with one being a distraction to the other.
Interactive Demo
Use the buttons below to:
- Step left/right
- Reset the environment
Try finding the optimal policy!