R-max – A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning

Brafman, R., and Tennenholtz, M. 2003. R-max - a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res., 3, p.213–231.

Info

I replicated R-max and the results presented in the paper by Alexander L. Strehl, Lihong Li, and Michael L. Littman (2009) "Reinforcement Learning in Finite MDPs: PAC Analysis". The R-max agent can be found in this GitHub repository (in python) and the RiverSwim environment in this GitHub repository.

warning

This content is a brief summary and includes my personal takeaways from the paper. It reflects the points I found most relevant, along with my own conclusions on the topics discussed.

For a more comprehensive understanding of the topics and to align with the authors' perspectives, please READ THE PAPER.

Introduction

R‑max (Brafman & Tennenholtz, 2002; Strehl, Li & Littman, 2009) is a model‑based reinforcement learning algorithm designed to achieve near‑optimal average reward in polynomial time under the PAC‑MDP framework. It formalizes the principle of optimism under uncertainty, maintaining an optimistic MDP model and updating it as the agent gathers experience.

Problem Setup & PAC‑MDP Framework

We consider a finite MDP $M = \langle \mathcal{S}, \mathcal{A}, P, R, \gamma\rangle$ with known state and action sets, but unknown transition $P(s'\mid s,a)$ and reward $R(s,a)$ . Define:

Return: $G_t = \sum_{k=0}^\infty \gamma^k R_{t+k+1}$ .
Policy value: $V_\pi(s) = \mathbb{E}_\pi[G_t \mid S_t=s]$ .
Optimal value: $V_*(s)=\max_\pi V_\pi(s)$ .

In the PAC‑MDP model we require that, with probability $1-\delta$ , the agent follows an $\epsilon$ -optimal policy on all but $\mathrm{poly}(|\mathcal{S}|,|\mathcal{A}|,1/\epsilon,1/(1-\gamma),\ln(1/\delta))$ steps.

How It Works

R‑max constructs an induced MDP $\widehat M$ over the same states plus a fictitious “unknown” state $s_{\max}$ . Each state–action pair $(s,a)$ is initially marked unknown, with:

Reward $\widehat R(s,a)= R_{\max}$ .
Transition to $s_{\max}$ with probability 1.

Once $(s,a)$ has been visited $m$ times, we mark it known and set:

\widehat R(s,a) = \mathrm{empirical\;mean}(R), \quad \widehat P(\cdot\mid s,a) = \mathrm{empirical\;freqs.}

This optimistic model “encourages” the agent to explore unknown $(s,a)$ because they appear maximally rewarding.

Pseudocode

Input: $\epsilon,\delta, R_{\max}, m$ Initialize all $(s,a)$ as unknown; set $\widehat R(s,a)=R_{\max}$ , transitions → $s_{\max}$ . Loop:

Solve $\widehat M$ (e.g., via value iteration) to get greedy policy $\pi$ .
Execute $\pi$ for $T$ steps or until a new $(s,a)$ becomes known.
On each visit to $(s,a)$ , record reward and next‑state; if visit count reaches $m$ , recompute $\widehat R,\widehat P$ and mark known.

Formal Guarantees

Theorem (Strehl et al., 2009) With $m=O\bigl(\frac{|\mathcal{S}|R_{\max}^2}{\epsilon^2(1-\gamma)^2}(\log|\mathcal{S}\mathcal{A}|+\log1/\delta)\bigr)$ , R‑max is PAC‑MDP: it follows an $\epsilon$ -optimal policy on all but
$O\left(\frac{|\{(s,a):U(s,a)\ge V_*(s)-\epsilon\}|}{\epsilon^3(1-\gamma)^3}\left( S + \ln \frac{SA}{\delta} \right) V_{\text{max}}^3 \ln \frac{1}{\delta} \ln \frac{1}{\epsilon (1-\gamma)} \right)$
steps w.p. $1-\delta$ .

References

Brafman, R. & Tennenholtz, M. (2002). R‑max – A General Polynomial-Time Algorithm for Near-Optimal RL. JMLR. 3:213–231.
Strehl, A., Li, L. & Littman, M. (2009). Reinforcement Learning in Finite MDPs: PAC Analysis. JMLR. 10:2413–2444.

Feel free to reach out with any questions or comments. Happy experimenting!

Introduction​

Problem Setup & PAC‑MDP Framework​

How It Works​

Formal Guarantees​

References​

Introduction

Problem Setup & PAC‑MDP Framework

How It Works

Formal Guarantees

References