GoRight Environment
Environment Dynamics
Let the state at time be
where
- Position , observed as
, .
Actions shift . - Status indicator , observed as
.
Follows a deterministic 2-step Markov cycle (see Talvitie et al. (2024)). - Prize indicators , in GoRight,
observed as .
All zeros except at , where:- If upon entering , then .
- Otherwise the bits cycle in sequence.
- Reward
Although the underlying MDP is purely discrete, we add small observation noise on each dimension to encourage robustness to continuous inputs. For this demo, we expose only the noiseless values.
GoRight presents a classic sparse-reward, delayed-gratification challenge:
- The agent must discover that only when the status indicator hits its maximum and the agent steps into the goal cell does the prize lights go fully on.
- All other visits to (x=10) merely cycle the lights, providing a tiny hint—but no immediate large reward.
- Once the true prize lights are on, a single “right” move repeatedly yields (+3), rewarding the agent to stay and exploit.
Algorithms must balance exploration (to hit status =10 at the correct time) and exploitation (to harvest the (+3) reward once found).
Interactive Demo
Use the buttons below to:
- Step left/right
- Reset the environment
Try finding the optimal policy!
GoRight
Discount:
Reward: 0.00
Current:5
Prizes:
Steps: 0Last Reward: 0.00Discounted Return: 0.00