Q-Learning

In the previous post, we laid the groundwork by exploring Markov Decision Processes (MDPs) and exact solution methods like value iteration and policy iteration. In this post, we'll build on those concepts to delve into deep Q-learning, an essential component of deep reinforcement learning.

Recap of the Previous Post

In the first post, we covered the basics of optimal planning or control within an MDP framework. An MDP consists of a set of states, actions, a probabilistic transition model, a reward function, a discount factor $\gamma$ , and a horizon $H$ . The objective is to find the optimal policy $\pi^*$ that maximizes expected rewards over time.

We introduced two exact methods to achieve this:

Value Iteration: Iteratively updates the value of each state until convergence.
Policy Iteration: Alternates between evaluating a policy and improving it until the optimal policy is found.

These methods are powerful but come with limitations:

Both methods require knowledge of the environment's dynamics, i.e., the probability of transitioning to the next state given the current state and action. However, in many real-world scenarios, the agent doesn't have access to this information and must learn from interaction with the environment.
These methods involve looping over all possible states and actions. For complex environments, the number of states and actions can be enormous, making such exhaustive loops impractical.

Given these limitations, the focus of the remaining posts, including this one, will shift to more scalable approaches. Specifically, we'll explore how to approximate solutions using sampling-based methods and function fitting instead of relying on tabular methods.

Sampling-Based Approximations:

The agent collects experiences by interacting with the environment and uses these experiences to approximate the optimal policy.
This approach circumvents the need for a complete dynamics model by relying on observed transitions.

Function Fitting:

Instead of storing values or actions for every possible state in a table, we use a function, often represented by a neural network, to approximate these values or actions.
This function can generalize across states, making it feasible to handle large or continuous state spaces.

Tabular Q-Learning

$Q^*(s,a) =$ expected utility starting in $s$ , taking action $a$ , and acting optimally.

Bellman Equation:

Q^*(s,a) = \displaystyle\sum_{s'}P(s'|s,a)(R(s,a,s')+\gamma \cdot \underset{a'}{max}\ Q^*(s',a'))

(Tabular) Q-Learning

Q-values iteration:

Q_{k+1}(s,a) \larr \displaystyle\sum_{s'}P(s'|s,a)(R(s,a,s')+\gamma \cdot \underset{a'}{max}\ Q_k(s',a'))

Rewrite as an expectation:

Q_{k+1}(s,a) \larr \mathbb E_{s'\sim P(s'|s,a)}[R(s,a,s')+\gamma \cdot \underset{a'}{max}\ Q_k(s',a')]

So the agent can experience samples:

For an state-action pair $(s,a)$ , receive: $s'\sim P(s'|s,a)$
Consider your old estimate: $Q_k(s,a)$
Consider your new sample estimate:

$target(s') = R(s,a,s') + \gamma \ \cdot \ \underset{a'}{max} \ Q_k(s',a')$
Incorporate the new estimate into a running average:

$Q_{k+1}(s,a) \larr (1 -\alpha)\cdot Q_k(s,a)+\alpha [target(s')]$

How to sample actions?

$\varepsilon$ -Greedy: Choose random actions with probability $\color{orange}\varepsilon$ (Exploration), otherwise choose action greedily (Exploitation).

Q-Learning Properties

Q-Learning converges to optimal policy even if you're action suboptimally.
This is called off-policy learning.
Caveats:
- You have to explore enough.
- You have to eventually make learning rate ( $\alpha$ $α$ ) small enough.
  - Otherwise, the latest experiences will make you hop around too much with every update.
- … but not decrease too quickly.
  - You'll not update enough.
Technical requirements:
- All states and actions are visited infinitely often
  - Basically, in the limit, it doesn't matter how you select actions.
- Learning rate schedule such that for all state and action pairs $(s,a)$ :

\displaystyle\sum_{t=0}^\infin \alpha_t(s,a)=\infin \qquad \qquad \displaystyle\sum_{t=0}^\infin \alpha_t^2(s,a)<\infin

Can Tabular Methods Scale?

Continuous (by crude discretization) and even discrete environments (such as tetris and Atari) have huge state space and it's not practical to have a table this large (tetris is $10^{60}$ , humanoid is $10^{100}$ ).

Approximate Q-Learning

Instead of a table, we have a parametrized Q function: $Q_\theta (s,a)$

Can be a linear function in features.
Or a neural net, decision tree, etc.

Learning rule:

Remember:

target(s') = R(s,a,s') + \gamma \cdot\underset{a'}{max}\ Q_{\theta_k}(s',a')

Update:

\theta_{k+1} \larr \theta_k - \alpha \nabla_{\theta}\Bigg[\frac{1}{2}(Q_{\theta}(s,a) - target(s'))^2\Bigg]\Bigg\vert_{\theta=\theta_k}

Deep Q Networks (DQN)

Source: https://doi.org/10.48550/arXiv.1312.5602

DQN Details

Uses Huber loss instead of squared loss on Bellman error.

L_{\delta}= \begin{cases} \frac{1}{2}a^2 & for |a|\le \delta, \\ \delta(|a|-\frac{1}{2}\delta), & otherwise. \end{cases}

Uses RMSProp instead of vanilla SGD.
It helps to anneal the exploration rate.
- Start $\varepsilon$ at 1 and anneal it to 0.1 or 0.05 over the first million frames.

ATARI Network Architecture

Convolutional neural network architecture:
- History of frames as input.
- One output per action - expected reward for that action $Q(s,a)$ .
- The final results used a slightly bigger network (3 convolutional fully-connected hidden layer).

Untitled

Updates since Deepmind's DQN

Double QDN

There is an upward bias in $\underset{a}{max}\ Q(s,a;\theta)$ .
DQN maintains two sets of weights $\theta$ $θ$ and $\theta^.$ $θ^{.}$ , so reduce bias by using:
- $\theta$ for selecting the best action.
- $\theta^.$ for evaluating the best action.

Double DQN loss:

L_i(\theta_i)=\mathbb E_{s,a,s',r \ D}(r+\gamma \ Q(s', \underset{a'}{argmax}\ Q(s',a';\theta);\theta_i^-)-Q(s,a;\theta_i))^2

Prioritized Experience Replay

Replaying all transitions with equal probability is highly suboptimal.
Replay transitions in proportion to absolute Bellman error:

\big|r + \gamma \ \underset{a'}{max} \ Q(s',a';\theta^-)-Q(s,a;\theta)\big|

leads to much faster learning.

Recap of the Previous Post

Tabular Q-Learning​

Bellman Equation:​

(Tabular) Q-Learning​

How to sample actions?​

Q-Learning Properties​

Can Tabular Methods Scale?​

Approximate Q-Learning​

Deep Q Networks (DQN)

DQN Details​

ATARI Network Architecture​

Updates since Deepmind's DQN​

Double QDN​

Prioritized Experience Replay​

See also​