Introduction to Reinforcement Learning

info

In this chapter you'll be introduced to:

Reinforcement Learning (RL): Understanding the fundamentals of RL and its significance.
K-armed bandit problem: A simplified RL scenario used to explain more complex concepts.
Action-value estimation: Methods for estimating the value of actions to make informed decisions.
Exploration-exploitation trade-off: The dilemma between choosing the best-known action and exploring new actions that might yield higher rewards.

What is Reinforcement Learning?

Reinforcement Learning (RL) is a field in artificial intelligence which studies problems where an agent learns to make decisions by performing actions in an environment to achieve a goal. At its core, RL is about an agent learning a policy—a mapping from observations of the environment to actions the agent should take—to maximize some notion of cumulative reward over time. Unlike supervised learning, where agents are given labeled examples, RL agents must discover the optimal actions through trial and error.

You can think of an RL agent like a student learning and trying to get the best grade on the midterm: when it tries a study strategy and gets a good grade, it’s more likely to repeat that strategy on the finals, and when it fails, they avoids it next time. More formally, agents learn by interacting with an environment and receiving feedback in the form of rewards, as we see below:

The k-Armed Bandit Problem

A classic example to illustrate decision-making under uncertainty is the k-armed bandit problem. Imagine a slot machine (or "bandit") with multiple levers (arms). Each lever provides a reward drawn from a probability distribution unique to that lever. The goal is to maximize the total reward over time by choosing which lever to pull.

Try pulling each lever to see if you can find out which one is the best:

3-Armed Bandit

Total Reward: 0

The key components are:

Actions ( $A_t$ ): The levers to pull.
Rewards ( $R_t$ ): The payoff received after taking an action at time $t$ .
Action Value ( $q_*(a)$ ): The expected reward of taking action $a$ .

The action value is defined as:

q_*(a) = \mathbb{E}[R_t | A_t = a]

In practice, you don’t know $q_*(a)$ upfront—you only learn it by pulling arm $a$ repeatedly and averaging the outcomes.

Estimating Action Values

To maximize rewards, the agent needs to estimate the value of each action. Since the true action values ( $q_*(a)$ ) are unknown, the agent must learn them from experience.

Sample-Average Method

One simple method to estimate action values is the sample-average method, where the estimated value $\hat{Q}_n(a)$ is the average of rewards received when action $a$ was taken $n$ times:

\hat{Q}_n(a) = \frac{1}{n} \sum_{i=1}^n R_i(a)\,

(where $R_i$ is the $i$ -th reward observed when choosing action $a$ ).

Example

Suppose arm $a=3$ has been pulled twice, yielding rewards $R_1=3$ and $R_2=5$ . Then $\hat{Q}_2(3) = \frac{3 + 5}{2} = 4.$

If you pull arm 3 a third time and observe $R_3 = 1$ , the update becomes $\hat{Q}_3(3) = \frac{3+5+1}{3} = 3.$

Notice how the new estimate shifts downward because the third reward (1) is below the previous average (4).

Try pulling the arms again and see how $\hat{Q}_n(a)$ evolves as sample averages:

3-Armed Bandit (Sample Average)

Estimated Value: 0.00

Total Reward: 0

Incremental Implementation

Storing all past rewards to compute the average can be impractical. Instead, we can update the estimates incrementally:

\hat{Q}_{n+1}(a) = \hat{Q}_n(a) + \frac{1}{n} \left(R_n(a) - \hat{Q}_n(a)\right)

This update rule adjusts the estimate based on the error between the received reward and the current estimate. The incremental update can be generalized using a step size $\alpha$ :

\hat{Q}_{n+1}(a) = \hat{Q}_n(a) + \alpha \left(R_n(a) - \hat{Q}_n(a)\right)

Using a constant step size (e.g., $\alpha = 0.1$ ) is beneficial in non-stationary problems (i.e., those where the reward distributions can change over time), giving more weight to recent rewards.

The general rule follows a recurrent pattern in RL:

\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize} [\text{Target} - \text{OldEstimate}]

Notice how the incremental update rule changes the estimated values depending on your chosen step‐size:

3-Armed Bandit (Incremental)

Step size (alpha):

Estimated Value: 0.00

Total Reward: 0

Exploration vs. Exploitation Trade-off

A fundamental challenge in RL is balancing exploration (trying new actions to gain more information) and exploitation (choosing the best-known action to maximize reward). If an agent always exploits, it may miss out on actions that could yield higher rewards because it never tries them. On the other hand, if it always explores, it may forgo the opportunity to accumulate higher rewards from known good actions. The dilemma arises because the agent cannot simultaneously explore and exploit; it must choose between gathering more information or leveraging existing knowledge to maximize immediate reward.

Epsilon-Greedy

One simple strategy is the epsilon-greedy method:

With probability $\epsilon$ , explore by choosing an action at random.
With probability $1 - \epsilon$ , exploit by choosing the best action.

To choose the best action, called greedy action, we choose the action with the highest estimated value:

A_t \doteq \underset{a}{\operatorname{argmax}} \hat{Q}_t(a)

Despite being simple, this strategy is still widely used in RL.

Optimistic Initial Values

Another approach to encourage exploration is to use optimistic initial values. By initializing action-value estimates to high values, the agent is driven to explore actions to correct these overestimations. But this approach comes with limitations:

It encourages exploration mainly in the beginning.
It may not adapt well if action values change over time.
It is sensible and requires knowledge of the reward range.
It doesn't scale well to the function approximation case, as we'll see in later chapters.

Upper-Confidence Bound (UCB) Action Selection

The UCB method selects actions based on both the estimated value and the uncertainty in that estimate:

A_t \doteq \underset{a}{\operatorname{argmax}} \left[ \hat{Q}_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}} \right]

Where:

$c$ is a parameter controlling the degree of exploration.
$N_t(a)$ is the number of times action $a$ has been selected.
$t$ is the total number of pulls.
Actions with higher uncertainty (less explored) get a higher bonus.

Intuitively, the $\sqrt{\frac{\ln t}{N_t(a)}}$ term starts large when $N_t(a)$ is small (encouraging exploration), but shrinks as you sample $a$ frequently.

The advantages compared to epsilon-greedy include:

The focus on actions with higher uncertainty.
It naturally reduces exploration over time as uncertainty decreases.

Recap

WIP

Gradient bandit algorithms and contextual bandits will be included in a future iteration of this chapter.

What's Next

In the next chapter, we'll introduce Markov Decision Processes (MDPs), which provide the mathematical framework for modeling sequential decision-making problems in RL. We'll explore how actions influence future states and rewards, and learn about policies, value functions, and the fundamental Bellman Equations that define optimal decision-making strategies.

Introduction to Reinforcement Learning

What is Reinforcement Learning?

The k-Armed Bandit Problem

3-Armed Bandit

Total Reward: 0

Estimating Action Values

Sample-Average Method

3-Armed Bandit (Sample Average)

Total Reward: 0

Incremental Implementation

3-Armed Bandit (Incremental)

Total Reward: 0

Exploration vs. Exploitation Trade-off

Epsilon-Greedy

Optimistic Initial Values

Upper-Confidence Bound (UCB) Action Selection

Recap

1. What is the goal of Reinforcement Learning (RL) agent?

2. What is the main challenge addressed by the exploration-exploitation trade-off?

3. In the k-armed bandit problem, what does the action value $q_*(a)$ represent?

4. How does the epsilon-greedy strategy balance exploration and exploitation?

5. What is the key advantage of Upper-Confidence Bound (UCB) action selection over epsilon-greedy?

What's Next

What is Reinforcement Learning?​

The k-Armed Bandit Problem​

3-Armed Bandit

Total Reward: 0

Estimating Action Values​

Sample-Average Method​

3-Armed Bandit (Sample Average)

Total Reward: 0

Incremental Implementation​

3-Armed Bandit (Incremental)

Total Reward: 0

Exploration vs. Exploitation Trade-off​

Epsilon-Greedy​

Optimistic Initial Values​

Upper-Confidence Bound (UCB) Action Selection​

Recap​

1. What is the goal of Reinforcement Learning (RL) agent?

2. What is the main challenge addressed by the exploration-exploitation trade-off?

3. In the k-armed bandit problem, what does the action value q∗(a)q_*(a)q∗​(a) represent?

4. How does the epsilon-greedy strategy balance exploration and exploitation?

5. What is the key advantage of Upper-Confidence Bound (UCB) action selection over epsilon-greedy?

What's Next​

What is Reinforcement Learning?

The k-Armed Bandit Problem

Estimating Action Values

Sample-Average Method

Incremental Implementation

Exploration vs. Exploitation Trade-off

Epsilon-Greedy

Optimistic Initial Values

Upper-Confidence Bound (UCB) Action Selection

Recap

3. In the k-armed bandit problem, what does the action value $q_*(a)$ represent?

What's Next