Introduction to Reinforcement Learning
In this chapter you'll be introduced to:
- Reinforcement Learning (RL): Understanding the fundamentals of RL and its significance.
- K-armed bandit problem: A simplified RL scenario used to explain more complex concepts.
- Action-value estimation: Methods for estimating the value of actions to make informed decisions.
- Exploration-exploitation trade-off: The dilemma between choosing the best-known action and exploring new actions that might yield higher rewards.
What is Reinforcement Learning?
Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by performing actions in an environment to achieve some notion of cumulative reward. Unlike supervised learning, where the model is trained on a fixed dataset of labeled examples, RL involves learning through interaction with the environment, receiving feedback in the form of rewards or penalties.
At its core, RL is about an agent learning a policy—a mapping from states of the environment to actions the agent should take—to maximize some notion of cumulative reward over time. This learning process is inspired by behavioral psychology, where actions leading to positive outcomes are reinforced.
Sequential Decision Making with Evaluative Feedback
In RL, agents learn by interacting with an environment and receiving feedback in the form of rewards. Unlike supervised learning, where agents are given labeled examples, RL agents must discover the optimal actions through trial and error.
The k-Armed Bandit Problem
A classic example to illustrate decision-making under uncertainty is the k-armed bandit problem. Imagine a slot machine (or "one-armed bandit") with multiple levers (arms). Each lever provides a reward drawn from a probability distribution unique to that lever. The goal is to maximize the total reward over time by choosing which lever to pull.
3-Armed Bandit
Total Reward: 0
The key components are:
- Actions (): The levers to pull.
- Rewards (): The payoff received after taking an action at time .
- Action Value (): The expected reward of taking action .
The action value is defined as:
Estimating Action Values
To maximize rewards, the agent needs to estimate the value of each action. Since the true action values () are unknown, the agent must learn them from experience.
Sample-Average Method
One simple method to estimate action values is the sample-average method, where the estimated value is the average of rewards received when action was taken times:
3-Armed Bandit (Sample Average)
Total Reward: 0
Incremental Implementation
Storing all past rewards can be impractical. Instead, we can update the estimates incrementally:
This update rule adjusts the estimate based on the error between the received reward and the current estimate. The incremental update can be generalized using a step size :
Using a constant step size (e.g., ) is beneficial in non-stationary problems, where the reward distributions change over time, giving more weight to recent rewards.
The general rule follows a recurrent pattern in RL:
3-Armed Bandit (Incremental)
Total Reward: 0
Exploration vs. Exploitation Trade-off
A fundamental challenge in RL is balancing exploration (trying new actions to gain more information) and exploitation (choosing the best-known action to maximize reward). If an agent always exploits, it may miss out on actions that could yield higher rewards because it never tries them. On the other hand, if it always explores, it may forgo the opportunity to accumulate higher rewards from known good actions. The dilemma arises because the agent cannot simultaneously explore and exploit; it must choose between gathering more information or leveraging existing knowledge to maximize immediate reward.
Epsilon-Greedy
One simple strategy is the epsilon-greedy method:
- With probability , explore by choosing an action at random.
- With probability , exploit by choosing the best action.
To choose the best action, called greedy action, we choose the action with the highest estimated value:
Optimistic Initial Values
Another approach to encourage exploration is to use optimistic initial values. By initializing action-value estimates to high values, the agent is driven to explore actions to correct these overestimations. But this approach comes with limitations:
- It encourages exploration mainly in the beginning.
- It may not adapt well if action values change over time.
- It is sensible and requires knowledge of the reward range.
Upper-Confidence Bound (UCB) Action Selection
The UCB method selects actions based on both the estimated value and the uncertainty in that estimate:
Where:
- is a parameter controlling the degree of exploration.
- is the number of times action has been selected.
- Actions with higher uncertainty (less explored) get a higher bonus.
The advantages compared to epsilon-greedy include:
- The focus on actions with higher uncertainty.
- It naturally reduces exploration over time as uncertainty decreases.
Recap
1. What is the goal of Reinforcement Learning (RL) agent?
2. What is the main challenge addressed by the exploration-exploitation trade-off?
3. In the k-armed bandit problem, what does the action value represent?
4. How does the epsilon-greedy strategy balance exploration and exploitation?
5. What is the key advantage of Upper-Confidence Bound (UCB) action selection over epsilon-greedy?
Gradient bandit algorithms and contextual bandits will be included in a future iteration over this chapter.
What's Next
In the next chapter, we'll introduce Markov Decision Processes (MDPs), which provide the mathematical framework for modeling sequential decision-making problems in RL. We'll explore how actions influence future states and rewards, and learn about policies, value functions, and the fundamental Bellman Equations that define optimal decision-making strategies.