Skip to main content

R-max – A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning

Brafman, R., and Tennenholtz, M. 2003. R-max - a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res., 3, p.213–231.

Info

I replicated R-max and the results presented in the paper by Alexander L. Strehl, Lihong Li, and Michael L. Littman (2009) "Reinforcement Learning in Finite MDPs: PAC Analysis". The R-max agent can be found in this GitHub repository (in python) and the RiverSwim environment in this GitHub repository.

warning

This content is a brief summary and includes my personal takeaways from the paper. It reflects the points I found most relevant, along with my own conclusions on the topics discussed.

For a more comprehensive understanding of the topics and to align with the authors' perspectives, please READ THE PAPER.

Introduction

R‑max (Brafman & Tennenholtz, 2002; Strehl, Li & Littman, 2009) is a model‑based reinforcement learning algorithm designed to achieve near‑optimal average reward in polynomial time under the PAC‑MDP framework. It formalizes the principle of optimism under uncertainty, maintaining an optimistic MDP model and updating it as the agent gathers experience.

Problem Setup & PAC‑MDP Framework

We consider a finite MDP M=S,A,P,R,γM = \langle \mathcal{S}, \mathcal{A}, P, R, \gamma\rangle with known state and action sets, but unknown transition P(ss,a)P(s'\mid s,a) and reward R(s,a)R(s,a). Define:

  • Return: Gt=k=0γkRt+k+1G_t = \sum_{k=0}^\infty \gamma^k R_{t+k+1}.
  • Policy value: Vπ(s)=Eπ[GtSt=s]V_\pi(s) = \mathbb{E}_\pi[G_t \mid S_t=s].
  • Optimal value: V(s)=maxπVπ(s)V_*(s)=\max_\pi V_\pi(s).

In the PAC‑MDP model we require that, with probability 1δ1-\delta, the agent follows an ϵ\epsilon-optimal policy on all but poly(S,A,1/ϵ,1/(1γ),ln(1/δ)) \mathrm{poly}(|\mathcal{S}|,|\mathcal{A}|,1/\epsilon,1/(1-\gamma),\ln(1/\delta)) steps.

How It Works

R‑max constructs an induced MDP M^\widehat M over the same states plus a fictitious “unknown” state smaxs_{\max}. Each state–action pair (s,a)(s,a) is initially marked unknown, with:

  • Reward R^(s,a)=Rmax\widehat R(s,a)= R_{\max}.
  • Transition to smaxs_{\max} with probability 1.

Once (s,a)(s,a) has been visited mm times, we mark it known and set:

R^(s,a)=empirical  mean(R),P^(s,a)=empirical  freqs.\widehat R(s,a) = \mathrm{empirical\;mean}(R), \quad \widehat P(\cdot\mid s,a) = \mathrm{empirical\;freqs.}

This optimistic model “encourages” the agent to explore unknown (s,a)(s,a) because they appear maximally rewarding.

Pseudocode

Input: ϵ,δ,Rmax,m\epsilon,\delta, R_{\max}, m Initialize all (s,a)(s,a) as unknown; set R^(s,a)=Rmax\widehat R(s,a)=R_{\max}, transitions → smaxs_{\max}. Loop:

  1. Solve M^\widehat M (e.g., via value iteration) to get greedy policy π\pi.
  2. Execute π\pi for TT steps or until a new (s,a)(s,a) becomes known.
  3. On each visit to (s,a)(s,a), record reward and next‑state; if visit count reaches mm, recompute R^,P^\widehat R,\widehat P and mark known.

Formal Guarantees

Theorem (Strehl et al., 2009) With m=O(SRmax2ϵ2(1γ)2(logSA+log1/δ))m=O\bigl(\frac{|\mathcal{S}|R_{\max}^2}{\epsilon^2(1-\gamma)^2}(\log|\mathcal{S}\mathcal{A}|+\log1/\delta)\bigr), R‑max is PAC‑MDP: it follows an ϵ\epsilon-optimal policy on all but

O({(s,a):U(s,a)V(s)ϵ}ϵ3(1γ)3(S+lnSAδ)Vmax3ln1δln1ϵ(1γ)) O\left(\frac{|\{(s,a):U(s,a)\ge V_*(s)-\epsilon\}|}{\epsilon^3(1-\gamma)^3}\left( S + \ln \frac{SA}{\delta} \right) V_{\text{max}}^3 \ln \frac{1}{\delta} \ln \frac{1}{\epsilon (1-\gamma)} \right)

steps w.p. 1δ1-\delta.

References

  • Brafman, R. & Tennenholtz, M. (2002). R‑max – A General Polynomial-Time Algorithm for Near-Optimal RL. JMLR. 3:213–231.
  • Strehl, A., Li, L. & Littman, M. (2009). Reinforcement Learning in Finite MDPs: PAC Analysis. JMLR. 10:2413–2444.

Feel free to reach out with any questions or comments. Happy experimenting!