Introduction to Reinforcement Learning
In this chapter you'll be introduced to:
- Reinforcement Learning (RL): Understanding the fundamentals of RL and its significance.
- K-armed bandit problem: A simplified RL scenario used to explain more complex concepts.
- Action-value estimation: Methods for estimating the value of actions to make informed decisions.
- Exploration-exploitation trade-off: The dilemma between choosing the best-known action and exploring new actions that might yield higher rewards.
What is Reinforcement Learning?
Reinforcement Learning (RL) is a field in artificial intelligence which studies problems where an agent learns to make decisions by performing actions in an environment to achieve a goal. At its core, RL is about an agent learning a policy—a mapping from observations of the environment to actions the agent should take—to maximize some notion of cumulative reward over time. Unlike supervised learning, where agents are given labeled examples, RL agents must discover the optimal actions through trial and error.
You can think of an RL agent like a student learning and trying to get the best grade on the midterm: when it tries a study strategy and gets a good grade, it’s more likely to repeat that strategy on the finals, and when it fails, they avoids it next time. More formally, agents learn by interacting with an environment and receiving feedback in the form of rewards, as we see below:
The k-Armed Bandit Problem
A classic example to illustrate decision-making under uncertainty is the k-armed bandit problem. Imagine a slot machine (or "bandit") with multiple levers (arms). Each lever provides a reward drawn from a probability distribution unique to that lever. The goal is to maximize the total reward over time by choosing which lever to pull.
Try pulling each lever to see if you can find out which one is the best:
3-Armed Bandit
Total Reward: 0
The key components are:
- Actions (): The levers to pull.
- Rewards (): The payoff received after taking an action at time .
- Action Value (): The expected reward of taking action .
The action value is defined as:
In practice, you don’t know upfront—you only learn it by pulling arm repeatedly and averaging the outcomes.
Estimating Action Values
To maximize rewards, the agent needs to estimate the value of each action. Since the true action values () are unknown, the agent must learn them from experience.
Sample-Average Method
One simple method to estimate action values is the sample-average method, where the estimated value is the average of rewards received when action was taken times:
(where is the -th reward observed when choosing action ).
Suppose arm has been pulled twice, yielding rewards and . Then
If you pull arm 3 a third time and observe , the update becomes
Notice how the new estimate shifts downward because the third reward (1) is below the previous average (4).
Try pulling the arms again and see how evolves as sample averages:
3-Armed Bandit (Sample Average)
Total Reward: 0
Incremental Implementation
Storing all past rewards to compute the average can be impractical. Instead, we can update the estimates incrementally:
This update rule adjusts the estimate based on the error between the received reward and the current estimate. The incremental update can be generalized using a step size :
Using a constant step size (e.g., ) is beneficial in non-stationary problems (i.e., those where the reward distributions can change over time), giving more weight to recent rewards.
The general rule follows a recurrent pattern in RL:
Notice how the incremental update rule changes the estimated values depending on your chosen step‐size:
3-Armed Bandit (Incremental)
Total Reward: 0
Exploration vs. Exploitation Trade-off
A fundamental challenge in RL is balancing exploration (trying new actions to gain more information) and exploitation (choosing the best-known action to maximize reward). If an agent always exploits, it may miss out on actions that could yield higher rewards because it never tries them. On the other hand, if it always explores, it may forgo the opportunity to accumulate higher rewards from known good actions. The dilemma arises because the agent cannot simultaneously explore and exploit; it must choose between gathering more information or leveraging existing knowledge to maximize immediate reward.
Epsilon-Greedy
One simple strategy is the epsilon-greedy method:
- With probability , explore by choosing an action at random.
- With probability , exploit by choosing the best action.
To choose the best action, called greedy action, we choose the action with the highest estimated value:
Despite being simple, this strategy is still widely used in RL.
Optimistic Initial Values
Another approach to encourage exploration is to use optimistic initial values. By initializing action-value estimates to high values, the agent is driven to explore actions to correct these overestimations. But this approach comes with limitations:
- It encourages exploration mainly in the beginning.
- It may not adapt well if action values change over time.
- It is sensible and requires knowledge of the reward range.
- It doesn't scale well to the function approximation case, as we'll see in later chapters.
Upper-Confidence Bound (UCB) Action Selection
The UCB method selects actions based on both the estimated value and the uncertainty in that estimate:
Where:
- is a parameter controlling the degree of exploration.
- is the number of times action has been selected.
- is the total number of pulls.
- Actions with higher uncertainty (less explored) get a higher bonus.
Intuitively, the term starts large when is small (encouraging exploration), but shrinks as you sample frequently.
The advantages compared to epsilon-greedy include:
- The focus on actions with higher uncertainty.
- It naturally reduces exploration over time as uncertainty decreases.
Recap
1. What is the goal of Reinforcement Learning (RL) agent?
2. What is the main challenge addressed by the exploration-exploitation trade-off?
3. In the k-armed bandit problem, what does the action value represent?
4. How does the epsilon-greedy strategy balance exploration and exploitation?
5. What is the key advantage of Upper-Confidence Bound (UCB) action selection over epsilon-greedy?
Gradient bandit algorithms and contextual bandits will be included in a future iteration of this chapter.
What's Next
In the next chapter, we'll introduce Markov Decision Processes (MDPs), which provide the mathematical framework for modeling sequential decision-making problems in RL. We'll explore how actions influence future states and rewards, and learn about policies, value functions, and the fundamental Bellman Equations that define optimal decision-making strategies.