Skip to main content

Q-Learning

In the previous post, we laid the groundwork by exploring Markov Decision Processes (MDPs) and exact solution methods like value iteration and policy iteration. In this post, we'll build on those concepts to delve into deep Q-learning, an essential component of deep reinforcement learning.

Recap of the Previous Post

In the first post, we covered the basics of optimal planning or control within an MDP framework. An MDP consists of a set of states, actions, a probabilistic transition model, a reward function, a discount factor γ\gamma, and a horizon HH. The objective is to find the optimal policy π\pi^* that maximizes expected rewards over time.

We introduced two exact methods to achieve this:

  • Value Iteration: Iteratively updates the value of each state until convergence.
  • Policy Iteration: Alternates between evaluating a policy and improving it until the optimal policy is found.

These methods are powerful but come with limitations:

  1. Both methods require knowledge of the environment's dynamics, i.e., the probability of transitioning to the next state given the current state and action. However, in many real-world scenarios, the agent doesn't have access to this information and must learn from interaction with the environment.

  2. These methods involve looping over all possible states and actions. For complex environments, the number of states and actions can be enormous, making such exhaustive loops impractical.

Given these limitations, the focus of the remaining posts, including this one, will shift to more scalable approaches. Specifically, we'll explore how to approximate solutions using sampling-based methods and function fitting instead of relying on tabular methods.

Sampling-Based Approximations:

  • The agent collects experiences by interacting with the environment and uses these experiences to approximate the optimal policy.
  • This approach circumvents the need for a complete dynamics model by relying on observed transitions.

Function Fitting:

  • Instead of storing values or actions for every possible state in a table, we use a function, often represented by a neural network, to approximate these values or actions.
  • This function can generalize across states, making it feasible to handle large or continuous state spaces.

Tabular Q-Learning

Q(s,a)=Q^*(s,a) = expected utility starting in ss, taking action aa, and acting optimally.

Bellman Equation:

Q(s,a)=sP(ss,a)(R(s,a,s)+γmaxa Q(s,a))Q^*(s,a) = \displaystyle\sum_{s'}P(s'|s,a)(R(s,a,s')+\gamma \cdot \underset{a'}{max}\ Q^*(s',a'))

(Tabular) Q-Learning

Q-values iteration:

Qk+1(s,a)sP(ss,a)(R(s,a,s)+γmaxa Qk(s,a))Q_{k+1}(s,a) \larr \displaystyle\sum_{s'}P(s'|s,a)(R(s,a,s')+\gamma \cdot \underset{a'}{max}\ Q_k(s',a'))

Rewrite as an expectation:

Qk+1(s,a)EsP(ss,a)[R(s,a,s)+γmaxa Qk(s,a)]Q_{k+1}(s,a) \larr \mathbb E_{s'\sim P(s'|s,a)}[R(s,a,s')+\gamma \cdot \underset{a'}{max}\ Q_k(s',a')]

So the agent can experience samples:

  • For an state-action pair (s,a)(s,a), receive: sP(ss,a)s'\sim P(s'|s,a)

  • Consider your old estimate: Qk(s,a)Q_k(s,a)

  • Consider your new sample estimate:

    target(s)=R(s,a,s)+γ  maxa Qk(s,a)target(s') = R(s,a,s') + \gamma \ \cdot \ \underset{a'}{max} \ Q_k(s',a')

  • Incorporate the new estimate into a running average:

    Qk+1(s,a)(1α)Qk(s,a)+α[target(s)]Q_{k+1}(s,a) \larr (1 -\alpha)\cdot Q_k(s,a)+\alpha [target(s')]

Screenshot_3.png

How to sample actions?

  • ε\varepsilon-Greedy: Choose random actions with probability ε\color{orange}\varepsilon (Exploration), otherwise choose action greedily (Exploitation).

Q-Learning Properties

  • Q-Learning converges to optimal policy even if you're action suboptimally.
  • This is called off-policy learning.
  • Caveats:
    • You have to explore enough.
    • You have to eventually make learning rate (α\alpha) small enough.
      • Otherwise, the latest experiences will make you hop around too much with every update.
    • … but not decrease too quickly.
      • You'll not update enough.
  • Technical requirements:
    • All states and actions are visited infinitely often
      • Basically, in the limit, it doesn't matter how you select actions.
    • Learning rate schedule such that for all state and action pairs (s,a)(s,a):
t=0αt(s,a)=t=0αt2(s,a)<\displaystyle\sum_{t=0}^\infin \alpha_t(s,a)=\infin \qquad \qquad \displaystyle\sum_{t=0}^\infin \alpha_t^2(s,a)<\infin

Can Tabular Methods Scale?

Continuous (by crude discretization) and even discrete environments (such as tetris and Atari) have huge state space and it's not practical to have a table this large (tetris is 106010^{60}, humanoid is 1010010^{100}).

Approximate Q-Learning

Instead of a table, we have a parametrized Q function: Qθ(s,a)Q_\theta (s,a)

  • Can be a linear function in features.
  • Or a neural net, decision tree, etc.

Learning rule:

  • Remember:
target(s)=R(s,a,s)+γmaxa Qθk(s,a)target(s') = R(s,a,s') + \gamma \cdot\underset{a'}{max}\ Q_{\theta_k}(s',a')
  • Update:
θk+1θkαθ[12(Qθ(s,a)target(s))2]θ=θk\theta_{k+1} \larr \theta_k - \alpha \nabla_{\theta}\Bigg[\frac{1}{2}(Q_{\theta}(s,a) - target(s'))^2\Bigg]\Bigg\vert_{\theta=\theta_k}

Deep Q Networks (DQN)

Source: https://doi.org/10.48550/arXiv.1312.5602

Source: https://doi.org/10.48550/arXiv.1312.5602

Screenshot_4.png

DQN Details

  • Uses Huber loss instead of squared loss on Bellman error.
Lδ={12a2foraδ,δ(a12δ),otherwise.L_{\delta}= \begin{cases} \frac{1}{2}a^2 & for |a|\le \delta, \\ \delta(|a|-\frac{1}{2}\delta), & otherwise. \end{cases}
  • Uses RMSProp instead of vanilla SGD.
  • It helps to anneal the exploration rate.
    • Start ε\varepsilon at 1 and anneal it to 0.1 or 0.05 over the first million frames.

ATARI Network Architecture

  • Convolutional neural network architecture:
    • History of frames as input.
    • One output per action - expected reward for that action Q(s,a)Q(s,a).
    • The final results used a slightly bigger network (3 convolutional fully-connected hidden layer).

Untitled

Updates since Deepmind's DQN

Double QDN

  • There is an upward bias in maxa Q(s,a;θ)\underset{a}{max}\ Q(s,a;\theta).
  • DQN maintains two sets of weights θ\theta and θ.\theta^., so reduce bias by using:
    • θ\theta for selecting the best action.
    • θ.\theta^. for evaluating the best action.
  • Double DQN loss:
Li(θi)=Es,a,s,r D(r+γ Q(s,argmaxa Q(s,a;θ);θi)Q(s,a;θi))2L_i(\theta_i)=\mathbb E_{s,a,s',r \ D}(r+\gamma \ Q(s', \underset{a'}{argmax}\ Q(s',a';\theta);\theta_i^-)-Q(s,a;\theta_i))^2

Prioritized Experience Replay

  • Replaying all transitions with equal probability is highly suboptimal.
  • Replay transitions in proportion to absolute Bellman error:
r+γ maxa Q(s,a;θ)Q(s,a;θ)\big|r + \gamma \ \underset{a'}{max} \ Q(s',a';\theta^-)-Q(s,a;\theta)\big|
  • leads to much faster learning.

See also

  • “Rainbow: Combining Improvements in Deep Reinforcement Learning”, Matteo Hessel et al, 2017.
    • Double DQN (DDQN)
    • Prioritized Replay DDQN
    • Dueling DQN
    • Distributional DQN
    • Noisy DQN