R-max – A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning
I replicated R-max and the results presented in the paper by Alexander L. Strehl, Lihong Li, and Michael L. Littman (2009) "Reinforcement Learning in Finite MDPs: PAC Analysis". The R-max agent can be found in this GitHub repository (in python) and the RiverSwim environment in this GitHub repository.
This content is a brief summary and includes my personal takeaways from the paper. It reflects the points I found most relevant, along with my own conclusions on the topics discussed.
For a more comprehensive understanding of the topics and to align with the authors' perspectives, please READ THE PAPER.
Introduction
R‑max (Brafman & Tennenholtz, 2002; Strehl, Li & Littman, 2009) is a model‑based reinforcement learning algorithm designed to achieve near‑optimal average reward in polynomial time under the PAC‑MDP framework. It formalizes the principle of optimism under uncertainty, maintaining an optimistic MDP model and updating it as the agent gathers experience.
Problem Setup & PAC‑MDP Framework
We consider a finite MDP with known state and action sets, but unknown transition and reward . Define:
- Return: .
- Policy value: .
- Optimal value: .
In the PAC‑MDP model we require that, with probability , the agent follows an -optimal policy on all but steps.
How It Works
R‑max constructs an induced MDP over the same states plus a fictitious “unknown” state . Each state–action pair is initially marked unknown, with:
- Reward .
- Transition to with probability 1.
Once has been visited times, we mark it known and set:
This optimistic model “encourages” the agent to explore unknown because they appear maximally rewarding.
Input: Initialize all as unknown; set , transitions → . Loop:
- Solve (e.g., via value iteration) to get greedy policy .
- Execute for steps or until a new becomes known.
- On each visit to , record reward and next‑state; if visit count reaches , recompute and mark known.
Formal Guarantees
Theorem (Strehl et al., 2009) With , R‑max is PAC‑MDP: it follows an -optimal policy on all but
steps w.p. .
References
- Brafman, R. & Tennenholtz, M. (2002). R‑max – A General Polynomial-Time Algorithm for Near-Optimal RL. JMLR. 3:213–231.
- Strehl, A., Li, L. & Littman, M. (2009). Reinforcement Learning in Finite MDPs: PAC Analysis. JMLR. 10:2413–2444.
Feel free to reach out with any questions or comments. Happy experimenting!