Skip to main content

Introduction to Reinforcement Learning

info

In this chapter you'll be introduced to:

  • Reinforcement Learning (RL): Understanding the fundamentals of RL and its significance.
  • K-armed bandit problem: A simplified RL scenario used to explain more complex concepts.
  • Action-value estimation: Methods for estimating the value of actions to make informed decisions.
  • Exploration-exploitation trade-off: The dilemma between choosing the best-known action and exploring new actions that might yield higher rewards.

What is Reinforcement Learning?

Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by performing actions in an environment to achieve some notion of cumulative reward. Unlike supervised learning, where the model is trained on a fixed dataset of labeled examples, RL involves learning through interaction with the environment, receiving feedback in the form of rewards or penalties.

At its core, RL is about an agent learning a policy—a mapping from states of the environment to actions the agent should take—to maximize some notion of cumulative reward over time. This learning process is inspired by behavioral psychology, where actions leading to positive outcomes are reinforced.

Sequential Decision Making with Evaluative Feedback

In RL, agents learn by interacting with an environment and receiving feedback in the form of rewards. Unlike supervised learning, where agents are given labeled examples, RL agents must discover the optimal actions through trial and error.

Loading...

The k-Armed Bandit Problem

A classic example to illustrate decision-making under uncertainty is the k-armed bandit problem. Imagine a slot machine (or "one-armed bandit") with multiple levers (arms). Each lever provides a reward drawn from a probability distribution unique to that lever. The goal is to maximize the total reward over time by choosing which lever to pull.

3-Armed Bandit

Total Reward: 0

The key components are:

  • Actions (AtA_t): The levers to pull.
  • Rewards (RtR_t): The payoff received after taking an action at time tt.
  • Action Value (q(a)q_*(a)): The expected reward of taking action aa.

The action value is defined as:

q(a)=E[RtAt=a]q_*(a) = \mathbb{E}[R_t | A_t = a]

Estimating Action Values

To maximize rewards, the agent needs to estimate the value of each action. Since the true action values (q(a)q_*(a)) are unknown, the agent must learn them from experience.

Sample-Average Method

One simple method to estimate action values is the sample-average method, where the estimated value Q^n(a)\hat{Q}_n(a) is the average of rewards received when action aa was taken nn times:

Q^n(a)=i=1nRin\hat{Q}_n(a) = \frac{\sum_{i=1}^{n} R_i}{n}

3-Armed Bandit (Sample Average)

Estimated Value: 0.00
Estimated Value: 0.00
Estimated Value: 0.00

Total Reward: 0

Incremental Implementation

Storing all past rewards can be impractical. Instead, we can update the estimates incrementally:

Q^n+1=Q^n+1n(RnQ^n)\hat{Q}_{n+1} = \hat{Q}_n + \frac{1}{n} (R_n - \hat{Q}_n)

This update rule adjusts the estimate based on the error between the received reward and the current estimate. The incremental update can be generalized using a step size α\alpha:

Q^n+1=Q^n+α(RnQ^n)\hat{Q}_{n+1} = \hat{Q}_n + \alpha (R_n - \hat{Q}_n)

Using a constant step size (e.g., α=0.1\alpha = 0.1) is beneficial in non-stationary problems, where the reward distributions change over time, giving more weight to recent rewards.

The general rule follows a recurrent pattern in RL:

NewEstimateOldEstimate+StepSize[TargetOldEstimate]\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize} [\text{Target} - \text{OldEstimate}]

3-Armed Bandit (Incremental)

Estimated Value: 0.00
Estimated Value: 0.00
Estimated Value: 0.00

Total Reward: 0

Exploration vs. Exploitation Trade-off

A fundamental challenge in RL is balancing exploration (trying new actions to gain more information) and exploitation (choosing the best-known action to maximize reward). If an agent always exploits, it may miss out on actions that could yield higher rewards because it never tries them. On the other hand, if it always explores, it may forgo the opportunity to accumulate higher rewards from known good actions. The dilemma arises because the agent cannot simultaneously explore and exploit; it must choose between gathering more information or leveraging existing knowledge to maximize immediate reward.

Epsilon-Greedy

One simple strategy is the epsilon-greedy method:

  • With probability ϵ\epsilon, explore by choosing an action at random.
  • With probability 1ϵ1 - \epsilon, exploit by choosing the best action.

To choose the best action, called greedy action, we choose the action with the highest estimated value:

AtargmaxaQ^t(a)A_t \doteq \underset{a}{\operatorname{argmax}} \hat{Q}_t(a)

Optimistic Initial Values

Another approach to encourage exploration is to use optimistic initial values. By initializing action-value estimates to high values, the agent is driven to explore actions to correct these overestimations. But this approach comes with limitations:

  • It encourages exploration mainly in the beginning.
  • It may not adapt well if action values change over time.
  • It is sensible and requires knowledge of the reward range.

Upper-Confidence Bound (UCB) Action Selection

The UCB method selects actions based on both the estimated value and the uncertainty in that estimate:

Atargmaxa[Q^t(a)+clntNt(a)]A_t \doteq \underset{a}{\operatorname{argmax}} \left[ \hat{Q}_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}} \right]

Where:

  • cc is a parameter controlling the degree of exploration.
  • Nt(a)N_t(a) is the number of times action aa has been selected.
  • Actions with higher uncertainty (less explored) get a higher bonus.

The advantages compared to epsilon-greedy include:

  • The focus on actions with higher uncertainty.
  • It naturally reduces exploration over time as uncertainty decreases.

Recap

1. What is the goal of Reinforcement Learning (RL) agent?

2. What is the main challenge addressed by the exploration-exploitation trade-off?

3. In the k-armed bandit problem, what does the action value q(a)q_*(a) represent?

4. How does the epsilon-greedy strategy balance exploration and exploitation?

5. What is the key advantage of Upper-Confidence Bound (UCB) action selection over epsilon-greedy?

WIP

Gradient bandit algorithms and contextual bandits will be included in a future iteration over this chapter.

What's Next

In the next chapter, we'll introduce Markov Decision Processes (MDPs), which provide the mathematical framework for modeling sequential decision-making problems in RL. We'll explore how actions influence future states and rewards, and learn about policies, value functions, and the fundamental Bellman Equations that define optimal decision-making strategies.