Foundations of Reinforcement Learning

This first post covers Markov Decision Processes (MDPs), exact solution methods and maximum entropy formulation. We'll define the main components of MDPs—states, actions, transition probabilities, and rewards—and look at exact solution methods like dynamic programming. While these methods work well for small-scale problems, they don't scale effectively to larger ones, pointing to the need for advanced techniques that we'll discuss in later posts.

Motivation

The examples below illustrate the potential of RL. It's not just about the impressive final results but the fact that these agents acquire skills through their own trial and error learning, demonstrating robust learning capabilities that can adapt to new tasks. This progress in DPRL is both exciting and promising for the future of artificial intelligence and robotics.

Markov Decision Processes (MDPs)

RL operates within the framework of Markov Decision Processes (MDPs). This framework provides a structured environment where an agent interacts by choosing actions that influence the state of the environment, which in turn provides feedback in the form of rewards.

In an MDP, the agent's interaction cycle involves observing the current state, taking an action, and receiving a reward based on the new state. This reward assesses how favorable the new state is. For example, in a video game, the score might serve as the reward, while for a running robot, the distance covered could determine the reward. The agent's objective is to maximize its cumulative reward by learning the optimal action for each situation through repeated interactions.

The key components that define an MDP are:

Set of States ( $S$ ): The different situations the agent can encounter.
Set of Actions ( $A$ ): The possible actions the agent can take.
Transition Function ( $P(s'|s,a)$ ): This defines the probability of transitioning to a new state $s'$ given the current state $s$ and action $a$ .
Reward Function ( $R(s,a,s')$ ): This assigns a reward based on the transition from state $s$ to state $s'$ via action $a$ .
Start State ( $s_0$ ): The initial state or distribution of states where the agent begins.
Discount Factor ( $\gamma$ ): This factor accounts for the preference of immediate rewards over future rewards. For example, if rewards are monetary, money available now can be invested to grow, making future money less valuable in comparison. Thus, future rewards are discounted.
Horizon ( $H$ ): The length of time over which the agent will operate and make decisions.

Goal

The goal is for the agent to learn a policy that maximizes its expected discounted reward over time:

\underset{\pi}{max}\ \mathbb E \bigg[\displaystyle\sum_{t=0}^H \gamma^t \cdot R(S_t, A_t, S_{t+1})|\pi \bigg]

Examples

Consider a cleaning robot where the states include the robot's position and the layout of the house. Actions might include moving or picking up objects. The transition model describes the outcome probabilities of these actions, and the reward function could be based on cleanliness or organization.
For a running robot, states include joint angles and velocities, with motor torques as actions. The reward might be for staying still or moving forward.

In both cases, the discount factor and horizon are chosen based on the desired timeframe for evaluating performance.
Other examples include balancing a pole, playing games like Tetris or Go, managing server requests, or navigating self-driving cars. Each of these problems can be mapped onto the MDP framework, allowing the use of reinforcement learning algorithms to find optimal policies.

Grid World

Throughout the course, a canonical example called grid world will be used to build intuition about MDPs and reinforcement learning algorithms. In grid world, an agent navigates a grid to reach a goal while avoiding obstacles and pitfalls, with rewards assigned to reaching specific squares.

In our grid world, the agent in this world can move in four directions, if it tries to move out of the grid or into obstacles, it stays in the same cell. If it lands on a minus-one square, it receives a reward of -1 and the episode ends; if it lands on a plus-one square, it receives a reward of +1 and the episode ends.

Exact Solution Methods

Value Iteration

Optimal Value Function $V^*$

The core concept of value iteration revolves around the optimal value function $V^*(s)$ . This function represents the maximum expected discounted sum of rewards that can be achieved from state $s$ by following the optimal policy. Essentially, it quantifies the best possible outcome starting from any given state.

V^*(s) = \underset{\pi}{max}\ \mathbb E[\displaystyle\sum_{t=0}^H \gamma^t \cdot R(s_t, a_t, s_{t+1})|\pi, s_0=s]

Grid World

Let's imagine three different scenarios:

A deterministic environment with no discounting ( $\gamma$ = 1) and a horizon $H$ of 100 steps, we can calculate the value function as follows:
- $V^*(4,3)$ : The agent immediately collects the +1 reward.
- $V^*(3,3)$ : The agent moves right to collect the +1 reward in one step.
- $V^*(2,3)$ : The agent moves right twice to collect the reward, still yielding a value of +1 due to no discounting.
- $V^*(1,1)$ : It takes five steps to reach the reward, but still has a value of +1.
- $V^*(4,2)$ : The agent is stuck in the bomb and receives a reward of -1.
1.00
1.00
1.00
1.00
1.00
0.00
1.00
-1.00
1.00
1.00
1.00
1.00
Introducing a discount factor gamma of 0.9, where actions are still deterministic:
- $V^*(4,3)$ : The value remains 1 because the reward is immediate.
- $V^*(3,3)$ : The reward is delayed by one step, so the value is $0.9^1 \times 1 = 0.9$ .
- $V^*(2,3)$ : The value is delayed by two steps, so the value is $0.9^2 \times 1 = 0.81$ .
- $V^*(1,1)$ : The value is $0.9^5 \times 1 = 0.59$ .
- $V^*(4,2)$ : The value remains -1 due to the bomb.
0.73
0.81
0.90
1.00
0.66
0.00
0.81
-1.00
0.59
0.66
0.73
0.66
Adding further complexity, if the action success probability drops to 0.8, the agent faces uncertainty:
- $V^*(4,3)$ : The value remains 1 since grabbing the prize is certain.
- $V^*(3,3)$ : Moving right has an 80% success rate. If successful, the value is $0.9 \times 1$ . With a 20% chance, the agent may stay in place (trying to move up), move down or move left, affecting the value recursively based on neighboring squares.

This recursive dependency on neighboring values is the essence of value iteration. The value of each state is updated based on expected values from possible actions and their outcomes, iterating until convergence.

Algorithm

Start with $V^*_0(s)=0$ for all $s$ .
For $k=1, \cdots, H$ steps:
For all states $s$ in $S$ :

\begin{gather} V_k^*(s) \leftarrow \max_a \sum_{s'}P(s'|s,a)(R(s,a,s')+\gamma V^*_{k-1}(s')) \\ \pi_k^*(s) \leftarrow \underset{a}{\operatorname{argmax}} \sum_{s'}P(s'|s,a)(R(s,a,s')+\gamma V^*_{k-1}(s')) \end{gather}

In the simulation below, we visualize the process of value iteration in a grid world environment. The agent attempts to learn the optimal policy by iteratively updating the values of each state. The noise parameter represents the probability of the agent taking a suboptimal action instead of the best one.

Value Iteration Convergence

Value iteration demonstrates a key property: convergence. As iterations progress, the value function gradually stops changing, indicating convergence. This means that running the algorithm long enough will yield a stationary optimal policy and the infinite horizon value, even without needing to run infinite iterations. The speed of convergence is influenced by the discount factor $\gamma$ . A lower discount factor (closer to 0) accelerates convergence, while a higher discount factor (closer to 1) slows it down.

Theorem

Value iteration converges. At convergence, we have found the optimal value function $V^*$ for the discounted infinite horizon problem, which satisfies the Bellman equations.

\forall s \in S: V^*(s)=\max_A \sum_{s'} T(s,a,s')[R(s,a,s')+\gamma V^*(s')]

Running value iteration until convergence produces $V^*$ , from which we can extract the optimal action using the Bellman equation or by storing the optimal action during the algorithm's execution. Importantly, the infinite horizon policy is stationary: the optimal action for any state $s$ remains the same over time. This efficiency in storage means we only need to store an action for each state, not for each time step, creating a convenient policy $\pi^*$ that prescribes the optimal action for each state.

\pi^*(s)=\underset{a\ \in\ A}{\operatorname{argmax}} \sum_{s'}T(s,a,s')[R(s,a,s')+\gamma V^*(s')]

Why does value iteration converge?

$V^*(s)$ represents the expected sum of rewards starting from state $s$ and acting optimally over infinite steps. Similarly, $V^*_H(s)$ is the expected sum of rewards starting from state $s$ and acting optimally over $H$ steps. The additional rewards collected beyond $H$ steps can be expressed as a geometric series, discounted by $\gamma$ :

\begin{gather*} \gamma^{H+1}R(s_{H+1})+\gamma^{H+2}R(s_{H+2})+\cdots \le \gamma^{H+1}R_{\max} + \gamma^{H+2}R_{\max} + \cdots =\\ = \frac{\gamma^{H+1}}{1-\gamma}R_{\max} \end{gather*}

Where $R_{\text{max}}$ is the maximum possible reward. As the horizon $H$ increases, $\gamma^{H+1}$ shrinks because $\gamma$ is less than 1. Consequently, the difference between the optimal value $V^*$ for infinite horizon and the optimal value $V^*_H$ for finite horizon $H$ decreases, becoming negligible as $H$ grows larger. This guarantees that value iteration converges, as the difference between $V^*$ and $V^*_H$ approaches zero.

Convergence through Contractions

Another way to understand the convergence of value iteration is through contractions. The key idea involves defining a norm, such as the max norm:

||U|| = \max_s |U(s)|

Where the max norm of a vector is the maximum absolute value among all its entries.

An update operation is a $\gamma$ -contraction in the max norm if, for any two vectors $U_i$ and $V_i$ , the updated vectors $U_{i+1}$ and $V_{i+1}$ are closer together by a factor of $\gamma$ compared to the original vectors $U_i$ and $V_i$ .

||U_{i+1}-V_{i+1}|| \le \gamma ||U_i-V_i||

A theorem states that a contraction converges to a unique fixed point regardless of initialization. Intuitively, this means that no matter where you start, the vectors get pulled closer together over time, eventually converging to the same point.

Value iteration is a $\gamma$ -contraction in the max norm. When applying a Bellman update to two different starting points, the resulting value functions are brought closer together. As a contraction, running the updates long enough ensures convergence to the optimal value function, $V^*$ .

Additionally, you can bound the error during the iteration process. The size of the update indicates how close you are to convergence. If the update is small, it means future changes will also be small, allowing you to estimate the error even with a finite number of iterations. This means you can stop value iteration once the updates are sufficiently small, indicating that you're close enough to the optimal value function.

||V_{i+1}-V_i|| \le \epsilon, \Rightarrow ||V_{i+1}-V^*|| \lt \frac{2\epsilon\gamma}{(1-\gamma)}

Q-Value Iteration

Having explored the concept of value iteration for states, we now turn to another important concept in reinforcement learning: Q-values. This will be a foundation of many advanced algorithms we'll explore in the next posts.

A Q-value, $Q^*(s, a)$ , represents the expected discounted sum of rewards starting from state $s$ , taking action $a$ , and thereafter acting optimally. Essentially, it's similar to a value function, but it considers the value of committing to a specific action in a given state.

The Bellman equation can also be applied to Q-values. The optimal Q-value for being in state $s$ and taking action $a$ can be expressed as:

Q^*(s, a) = \sum_{s'} P(s'| s,a) (R(s, a, s') + \gamma \max_{a'} Q^*(s', a'))

Q-value iteration:

Q_{k+1}^*(s, a) \larr \sum_{s'} P(s'| s,a) (R(s, a, s') + \gamma \max_{a'} Q_{k}^*(s', a'))

Policy Iteration

After covering value iteration, both based on state values (V-values) and action values (Q-values), we turn to another method in reinforcement learning: policy iteration. While value iteration is effective for solving small MDPs by looping through all states and actions, policy iteration offers an alternative approach that is particularly useful in different situations. Some advanced methods we'll explore later will be built on value iteration, while others will rely on policy iteration. Understanding both methods provides a solid foundation for tackling various types of problems.

Policy Evaluation

The first step in policy iteration is policy evaluation. Unlike value iteration, where we seek to maximize the expected reward by evaluating all possible actions, policy evaluation fixes the policy $\pi$ . In this step, we evaluate the fixed policy by running value iteration-like updates without taking the maximum over actions. This process gradually gives us the value function $V^\pi(s)$ for the given policy $\pi$ across all states, assuming an infinite horizon.

The policy evaluation formula is:

V_{k+1}^\pi (s) \leftarrow \sum_{s'}\sum_a \pi(a|s) P(s'|s,a) \left[ R(s,a,s') + \gamma V^\pi_k(s') \right]

This equation calculates the value of each state by considering the probability of taking an action $a$ under the policy $\pi$ , transitioning to a future state $s'$ , and receiving a reward $R(s,a,s')$ , while also accounting for the discounted future value $V^\pi_k(s')$ . At convergence, this process gives us the value function for the specific policy $\pi$ .

Policy Iteration Process

Policy iteration alternates between policy evaluation and policy improvement. Here's how it works:

Policy Evaluation: Start with an initial policy $\pi_k$ and evaluate it using the formula below. This gives us the value function for the current policy across all states.

V_{k+1}^{\pi_k} (s) \leftarrow \sum_{s'} P(s'|s,\pi_k(s)) \left[ R(s,a,s') + \gamma V_i^{\pi_k}(s') \right]

Policy Improvement: After evaluating the policy, use the value function to improve the policy. The improvement step involves looking one step ahead to find the action that maximizes the expected reward plus the discounted future value. This new action becomes the updated policy $\pi_{k+1}$ for each state $s$ .

\pi_{k+1}(s) = \argmax_a \sum_{s'} P(s'|s,a) [R(s,a,s')+\gamma V^{\pi_k}(s')]

Iteration: Repeat the process—evaluate the new policy $\pi_{k+1}$ , then improve it again. This cycle continues until the policy converges to the optimal policy $\pi^*$ .

The policy improvement step essentially performs a value iteration update but focuses on the best action at each state based on the current policy's evaluation. The updated policy for each state $s$ is determined by the action that maximizes the expected immediate reward and the future discounted value.

Efficiency trade-offs

Policy iteration converges to the optimal policy $\pi^*$ through this iterative process.

Interestingly, it often converges in fewer outer iterations than value iteration. However, the trade-off is that within each outer iteration, multiple inner iterations (similar to value iteration) are performed during the policy evaluation phase. This means that although policy iteration may require fewer overall iterations, it involves more computational work within each iteration.

Theorem

Policy iteration is guaranteed to converge and at convergence, the current policy and its value function are the optimal policy and the optimal value function.

Intuition behind convergence

To understand why policy iteration converges and guarantees an optimal policy, let's break down the process and build some intuition.

Convergence Guarantee

Policy iteration consists of two main steps: policy evaluation and policy improvement. The key to understanding why policy iteration converges lies in the fact that each iteration improves the policy. Here's why:

Policy Improvement: At every step, policy iteration selects a new policy that is strictly better than the previous one. This improvement is achieved through a "one-step lookahead," where the algorithm evaluates all possible actions from the current state and chooses the one that maximizes the expected reward. This ensures that the new policy is at least as good as, and typically better than, the current one.
Finite Number of Policies: In any given finite environment, there are only a limited number of possible policies. Since each iteration of policy improvement leads to a strictly better policy, the process cannot revisit a previous policy. This means that after a certain number of iterations—specifically, no more than the total number of possible policies—the algorithm must converge, as there are no more new policies to improve upon.

Why Convergence Leads to Optimality

Once policy iteration converges, the resulting policy is optimal. Here's the intuition:

Optimality at Convergence: At convergence, the policy $\pi_{k+1}$ is identical to $\pi_k$ across all states. This means that the action prescribed by $\pi_k$ is already the one that maximizes the expected reward, considering both the immediate reward and the discounted future rewards. This condition satisfies the Bellman optimality equation.
Satisfying the Bellman Equation: The Bellman equation is a fundamental principle in dynamic programming, which, when satisfied, indicates that the value function is optimal, denoted as $V^*$ . In policy iteration, when the policy no longer changes, it implies that the Bellman equation holds true, confirming that the current policy is indeed optimal.
Beyond One-Step Improvement: The one-step lookahead in policy iteration isn't just a single improvement step. When applied iteratively across all states, it results in a cumulative effect that optimizes the policy across the entire state space. The algorithm improves not only the immediate action but also how the policy performs in the long run.

Maximum Entropy Formulation

So far, we've covered two fundamental methods for solving Markov Decision Processes (MDPs): value iteration and policy iteration. These methods serve as the foundation for many of the algorithms we'll explore in future posts, particularly those designed to tackle larger-scale problems. However, it's important to introduce another formulation for solving or framing MDPs: the maximum entropy formulation.

What is Entropy?

Entropy, in the context of information theory, is a measure of uncertainty or randomness in a random variable $X$ . It's defined as the expected amount of information (measured in bits) required to encode the variable. The mathematical expression for the entropy $\mathcal{H}(X)$ of a random variable $X$ with a probability distribution $P$ is:

\mathcal{H}(X) = \sum_i p(x_i) \log_2\frac{1}{p(x_i)} = - \sum_i p(x_i) \log_2 p(x_i)

This equation sums over all possible values $x_i$ that the random variable $X$ can take, weighted by the probability $p(x_i)$ of each value. The result gives us a measure of the average uncertainty associated with $X$ .

To break it down further:

If a particular value of $X$ is very likely, it can be encoded with fewer bits because it occurs frequently.
Conversely, less likely values require more bits to encode.

The Concept of Maximum Entropy

The maximum entropy approach asks a different question: Instead of finding a single optimal policy, what if we could find a distribution over near-optimal solutions? This means rather than pinpointing just one "best" behavior, we aim to identify a distribution that assigns higher probabilities to good behaviors and lower probabilities to less effective ones. This approach provides several advantages:

More Robust Policies: A distribution over possible behaviors leads to a more robust policy. If the optimal path becomes blocked or unavailable due to changes in the environment, the distribution allows the agent to fall back on other near-optimal behaviors, maintaining high performance despite the disruption.
More Robust Learning: In larger-scale problems, where the solution process involves iterative learning with data collection, a maximum entropy approach can improve exploration. If a policy is too deterministic and highly optimized, it may collect less diverse data, potentially missing out on discovering better solutions. By introducing more variation in behavior, the maximum entropy approach encourages better exploration, which can lead to finding better optima as the policy improves over time.

Examples

To better understand the concept of entropy in the context of random variables, let's look at some examples.

Binary Random Variable

Consider a binary random variable $X$ that can take on two values, 0 or 1. The probability $P(X = 1)$ is a number between 0 and 1. The entropy of this distribution is highest when the probability is 0.5, which corresponds to the maximum uncertainty about what the value of $X$ will be. At this point, we have no preference for either outcome, meaning that both 0 and 1 are equally likely, as seen in the graph below:

Histograms

Now, let's compare two histograms to understand how entropy works in practice. Imagine two different distributions represented by these histograms. The question is: which histogram has a higher entropy?

Click on the one you think has the higher entropy

Explanation

Histogram on the Left: This histogram shows a more evenly distributed set of outcomes, meaning there's significant uncertainty about which value the random variable might take. This uncertainty results in higher entropy.
Histogram on the Right: In contrast, the histogram on the right shows a distribution where one value is overwhelmingly likely, leading to lower uncertainty and, therefore, lower entropy.

Show me the math

\begin{align*} \text{Graph 1}&\\\\ p(S) &= \{0.25, 0.25, 0.25, 0.125, 0.125\} \\\\ H &= 3 \times 0.25 \times \log_2 4 + 2 \times 0.125 \times \log_2 8 \\ &= 1.5 + 0.75 = 2.25\\\\ \text{Graph 2}&\\\\ p(s) &= \{0.75, 0.0625, 0.0625, 0.0625, 0.0625\} \\\\ H &= 0.75 \times \log_2 \left(\frac{4}{3}\right) + 4 \times 0.0625 \times \log_2 16 \\ &= 0.3 + 1 = 1.3 \end{align*}

Maximum Entropy MDP

Now that we have a basic understanding of entropy, which measures the amount of uncertainty or variability in the outcome of a distribution, we can apply this concept to Markov Decision Processes (MDPs).

In a traditional MDP, the goal is to maximize the expected sum of rewards over time, which can be expressed as:

\max_\pi \mathbb{E}\bigg[ \sum_{t=0}^H r_t \bigg]

This objective focuses solely on accumulating as much reward as possible, possibly including a discount factor (not shown here for simplicity).

In contrast, a Maximum Entropy MDP adds an additional term to the objective function. Instead of just maximizing the sum of rewards, it also considers the entropy of the policy. The objective function becomes:

\max_\pi \mathbb{E}\bigg[ \sum_{t=0}^H r_t + \beta \mathcal{H}(\pi(\cdot|s_t)) \bigg]

Here, $\mathcal{H}(\pi(\cdot|s_t))$ represents the entropy of the policy at state $s_t$ , and $\beta$ is a trade-off factor that balances the importance of maximizing reward against maintaining high entropy in the policy. This term encourages the policy to remain stochastic rather than deterministic. In a purely deterministic policy, the entropy would be zero, which is not desirable in this context.

This has several implications:

Trade-Off Between Reward and Entropy: The factor $\beta$ determines how much emphasis is placed on maximizing entropy relative to maximizing reward. A higher $\beta$ prioritizes a more diverse range of actions, potentially at the cost of collecting less reward. Conversely, a lower $\beta$ (or $\beta = 0$ ) focuses solely on reward maximization, disregarding entropy.
Performance vs. Robustness: At first glance, introducing a trade-off that potentially reduces reward might seem counterproductive. However, as discussed earlier, policies that consider entropy can lead to more robust learning and more robust policies. By keeping the policy stochastic for a longer period during training, the agent is likely to explore more effectively, leading to better overall performance in the long run. This exploration can prevent the policy from getting stuck in suboptimal solutions and improve trial-and-error learning.
Adaptability: A policy that maximizes entropy is less rigid and more adaptable to changes in the environment. If the environment shifts, the policy's inherent flexibility allows it to switch to alternative actions more readily, rather than being locked into a single deterministic choice for each state.

Maximum Entropy for 1-Step Problem

Consider a one-step problem where we want to maximize the expected reward while also maintaining high entropy in the policy. The objective function for this problem is:

\max_\pi \mathbb{E}\left[ r(a) + \beta \mathcal{H}(\pi(\cdot|s)) \right]

Here, $\pi(a)$ is the policy, represented as a distribution over possible actions, and $\mathcal{H}(\pi(\cdot|s))$ is the entropy of the policy. The goal is to balance maximizing the reward $r(a)$ with maintaining high entropy, controlled by the trade-off parameter $\beta$ .

Since $\pi(a)$ represents a probability distribution, its entries must sum to one. This constraint leads us to form a Lagrangian for this problem, which we then optimize by setting the derivatives with respect to $\pi(a)$ and $\lambda$ (the Lagrange multiplier) to zero.

Constrained Optimization and Lagrangian Recap

Constrained optimization involves maximizing an objective function subject to one or more constraints. In simple terms, you want to find the best possible solution while adhering to certain restrictions.

The general form of a constrained optimization problem is:

\begin{aligned} \text{max} \quad & f(x) \\ \text{s.t.} \quad & g(x) = 0 \end{aligned}

Where $f(x)$ is the objective function we aim to maximize, and $g(x) = 0$ represents the constraints that limit our choices for $x$ .

The Lagrange Multiplier

To solve such problems, we often use the method of Lagrange Multiplier. This method introduces a new variable, $\lambda$ , known as the Lagrange multiplier, and reformulates the problem into a max-min problem:

\max_{x} \min_{\lambda} \mathcal{L}(x, \lambda) = f(x) + \lambda g(x)

In this equation, $\mathcal{L}(x, \lambda)$ is the Lagrangian, which combines the original objective function $f(x)$ with the constraint $g(x)$ weighted by the Lagrange multiplier $\lambda$ . This approach transforms the constrained optimization problem into a problem that can be solved by finding the optimal values of both $x$ and $\lambda$ .

Gradients in Optimization

One of the key results in constrained optimization is that at the optimum, the gradient (or derivative) of the Lagrangian with respect to both $x$ and $\lambda$ must be zero. Mathematically, this is expressed as:

\begin{aligned} \frac{\partial \mathcal{L}(x, \lambda)}{\partial x} &= 0 \\ \frac{\partial \mathcal{L}(x, \lambda)}{\partial \lambda} &= 0 \end{aligned}

These conditions ensure that we have found a point where the objective function is maximized, given the constraints. In other words, the solution lies at a point where the trade-offs between maximizing the objective and satisfying the constraints are perfectly balanced.

Solving this optimization problem yields the optimal Max-Ent policy:

\pi(a) = \frac{1}{Z} \exp\left(\frac{1}{\beta} r(a)\right)

Where $Z$ is a normalization constant ensuring that the probabilities sum to one:

Z = \sum_{a} \exp\left(\frac{1}{\beta} r(a)\right)

Policy Derivation

\begin{align*} \max_{\pi(a)} \quad& \mathbb{E}[r(a) + \beta \mathcal{H}(\pi(a))] \\ \max_{\pi(a)} \quad& \sum_a\pi(a)r(a) - \beta \sum_a\pi(a)\log\pi(a) \\ \max_{\pi(a)}\min_\lambda \mathcal{L}(\pi(a),\lambda) =& \sum_a \pi(a)r(a) - \beta \sum_a \pi(a) \log \pi(a) + \lambda \bigg(\sum_a\pi(a)-1 \bigg)\\\\ \end{align*}

\begin{gather*} \frac{\partial}{\partial\pi(a)} \mathcal{L}(\pi(a), \lambda) = 0 \\ \frac{\partial}{\partial\pi(a)} \sum_a \pi(a)r(a) - \beta \sum_a \pi(a) \log \pi(a) + \lambda \bigg(\sum_a\pi(a)-1 \bigg) = 0\\ r(a) - \beta \log \pi(a) + \lambda = 0 \\ \beta \log \pi(a) = r(a) - \beta + \lambda \\ \pi(a) = \exp\bigg[ \frac{1}{\beta}(r(a)-\beta+\lambda) \bigg] \\\\ \end{gather*}

\begin{gather*} \frac{\partial}{\partial\lambda} \mathcal{L}(\pi(a), \lambda) = 0 \\ \sum_a \pi(a) - 1 = 0 \end{gather*}

This solution has a few key features:

Exponential Reward Weighting: The probability of taking an action is proportional to the exponentiated reward associated with that action. Actions with higher rewards have higher probabilities, while those with lower or negative rewards have lower probabilities.
Role of Beta: The parameter $\beta$ controls the trade-off between reward maximization and entropy. As $\beta$ increases, the policy favors higher entropy, leading to a more uniform distribution over actions. Conversely, as $\beta$ decreases (approaching zero), the policy becomes more deterministic, heavily favoring actions with the highest rewards.

The optimal value under this Max-Ent formulation is given by the softmax of the rewards:

V = \beta \log \sum_{a} \exp\left( \frac{1}{\beta} r(a) \right)

This equation represents a softmax operation on the rewards, where the sharpness of the softmax depends on $\beta$ :

High Beta: Leads to a soft softmax, with more uniform probabilities across actions.
Low Beta: Leads to a sharper softmax, closely approximating the standard maximum operation.

Value Derivation

We substitute $\pi(a)$ in the value equation:

\begin{align*} V &= \sum_a \frac{1}{Z} \exp\left(\frac{1}{\beta} r(a)\right) r(a) - \beta \sum_a \frac{1}{Z} \exp\left(\frac{1}{\beta} r(a)\right) \underbrace{\log \bigg( \frac{1}{Z} \exp\left(\frac{1}{\beta} r(a)\right) \bigg)}_{\log(\exp\frac{1}{\beta} r(a)) + \log(\frac{1}{Z})} \\ &= \sum_a \frac{1}{Z} \exp\left(\frac{1}{\beta} r(a)\right) \bigg (r(a) - \underbrace{\beta \log \bigg( \exp\left(\frac{1}{\beta} r(a)\right) \bigg)}_{r(a)} \bigg) -\beta \sum_a \frac{1}{Z} \exp\left(\frac{1}{\beta} r(a)\right) \log \frac{1}{Z}\\ &= 0 - \beta \log\bigg( \frac{1}{Z}\bigg) \underbrace{\sum_a \frac{1}{Z} \exp \bigg( \frac{1}{\beta} r(a)\bigg)}_{1} \\ &= -\beta \log \frac{1}{Z} \\ &= \beta \log \sum_a \exp\left(\frac{1}{\beta}r(a)\right) \Rightarrow \text{softmax} \end{align*}

Maximum Entropy Value Iteration

Now, let's consider how this approach can be applied to multi-step MDPs. In multi-step settings, our goal is to maximize not just the immediate reward and entropy but also the cumulative future rewards. This leads us to the Max-Ent Bellman equation, which is a modification of the traditional Bellman equation.

Given the Max-Ent formulation and general objective:

\max_\pi \mathbb{E} \bigg[ \sum_{t=0}^H r_t + \beta \mathcal{H(\pi(\cdot|s_t))}\bigg]

The 1-step update:

V_k(s) = \max_\pi \mathbb{E} \bigg[ \sum^H_{t=H-k} r(s_t, a_t) + \beta \mathcal{H}(\pi(a_t|s_t)) \bigg]

The Max-Ent Bellman equation is formulated as follows:

\begin{align*} V_k(s) &= \max_\pi \mathbb{E} \bigg[ r(s,a) + \beta \mathcal{H}(\pi(a|s)) + V_{k-1}(s') \bigg] \\ &= \max_\pi \mathbb{E} \bigg[ Q_k(s,a) + \beta \mathcal{H}(\pi(a|s)) \bigg] \end{align*}

where the Q-values are defined by:

Q_k(s, a) = \mathbb{E}[r(s, a) + V_{k-1}(s')]

Therefore, we can write the new Bellman equation as the 1-step problem equation replacing the reward for the Q-value:

\begin{gather*} V_k(s) = \beta \log \sum_a \exp \big( \frac{1}{\beta} Q_k(s,a) \big) \\\\ \pi_k(a|s) = \frac{1}{Z} \exp \big( \frac{1}{\beta} Q_k(s,a) \big) \end{gather*}

Here:

$V_k(s)$ represents the value function at step $k$ .
$Q_k(s, a)$ is the expected reward for taking action $a$ in state $s$ , plus the discounted future value.
pi_k(a|s)$ is resulting policy from this value iteration process is also a Softmax distribution over the Q-values.
The Softmax operation is applied to the Q-values, where the sharpness of the Softmax is controlled by the temperature parameter $\beta$ .

Recap

In this first post, we covered several foundational topics:

Motivation: We explored why Deep Reinforcement Learning (RL) is an exciting and impactful area of study.
Markov Decision Processes (MDPs): We introduced MDPs as the formal framework underlying RL methods.
Exact Solution Methods: We delved into two exact solution methods for traditional MDPs: value iteration and policy iteration.
Maximum Entropy Formulation: We introduced the Max-Ent formulation, which modifies traditional value iteration and policy iteration by incorporating entropy, leading to stochastic policies.

The methods we've discussed so far are well-suited for small MDPs where it's feasible to loop over all states and actions and perform Bellman updates, whether through hard maxes or Softmaxes. However, most practical MDPs involve a vast number of states, making it impractical to iterate over all possible states and actions directly. The subsequent posts in this series will focus on how to address this challenge, building on the foundations we've already established.

Motivation​

Markov Decision Processes (MDPs)​

Exact Solution Methods​

Value Iteration​

Optimal Value Function V∗V^*V∗​

Value Iteration Convergence​

Q-Value Iteration​

Policy Iteration​

Policy Evaluation​

Policy Iteration Process​

Convergence Guarantee​

Why Convergence Leads to Optimality​

Maximum Entropy Formulation​

The Concept of Maximum Entropy​

Binary Random Variable​

Histograms​

Maximum Entropy MDP​

Maximum Entropy for 1-Step Problem​

The Lagrange Multiplier​

Gradients in Optimization​

Maximum Entropy Value Iteration​

Recap​