Regularization

info

In this chapter, you'll learn about:

Overfitting and Hypothesis Class Complexity: Understanding how the size of the hypothesis class affects bias and variance.
Regularization Techniques: Introducing methods to restrict the hypothesis class to improve generalization.
Norms and Their Role in Regularization: Exploring different norms (L1 and L2) and their geometric interpretations.
L1 vs. L2 Regularization: Comparing the effects of L1 and L2 regularization on model weights and sparsity.
Constraint Optimization and Duality: Understanding how constrained optimization problems relate to regularization.
Practical Implementation: Learning how to incorporate regularization into optimization problems.

In the previous chapter, we discussed the bias-variance tradeoff in the context of linear regression. We observed that:

High model complexity (a large hypothesis class) can lead to low bias but high variance.
Low model complexity (a restricted hypothesis class) can lead to high bias but low variance.

Our goal is to find a balance between bias and variance to achieve good generalization performance. One way to control model complexity is through regularization, which involves adding constraints or penalties to the optimization problem to restrict the hypothesis class.

The Need for Regularization

Overfitting and Hypothesis Class Size

When the hypothesis class is too large (e.g., high-degree polynomials or models with many features), the model can:

Fit the training data very closely, including noise.
Exhibit high variance and poor generalization to unseen data.

Conversely, if the hypothesis class is too small, the model may:

Be unable to capture the underlying patterns in the data.
Exhibit high bias and underfit the data.

Balancing Bias and Variance

The bias-variance tradeoff can be visualized as:

Bias-Variance Tradeoff

X-axis: Model complexity (size of the hypothesis class).
Y-axis: Error components (bias and variance).
Optimal Point: The sweet spot where both bias and variance are minimized.

Regularization helps us navigate this tradeoff by controlling model complexity.

Regularization Techniques

Restricting the Hypothesis Class

Consider the following attempts to control model complexity:

Attempt 1: Include All Features (Unrestricted Model)

Hypothesis: $h(\mathbf{x}) = w_0 + w_1 x_1 + w_2 x_2 + \dots + w_d x_d$
Characteristics:
- Uses all available features.
- High model complexity.
- Prone to overfitting when $d$ is large relative to the number of samples.

Attempt 2: Manual Feature Selection

Hypothesis: $h(\mathbf{x}) = w_0 + w_{i_1} x_{i_1} + w_{i_2} x_{i_2} + \dots + w_{i_n} x_{i_n}$
Characteristics:
- Includes only a subset of features believed to be relevant.
- Reduces model complexity.
- May exclude useful information and still overfit if $n$ is large.

Attempt 3: Include All Features with Regularization

Hypothesis: $h(\mathbf{x}) = w_0 + w_1 x_1 + w_2 x_2 + \dots + w_d x_d$
Constraint: $\| \mathbf{w} \| \leq c$
Characteristics:
- Includes all features.
- Restricts the magnitude of the weights.
- Controls model complexity without excluding features.

Regularization as a Constraint

We aim to minimize the loss function subject to a constraint on the weights:

General Form

Objective: $\min_{\mathbf{w}} \; J(\mathbf{w}) \quad \text{subject to} \quad \| \mathbf{w} \| \leq c$
$J(\mathbf{w})$ : Loss function (e.g., mean squared error).
$\| \mathbf{w} \|$ : Norm of the weight vector.

Common Norms

L2 Norm (Euclidean Norm): $\| \mathbf{w} \|_2 = \left( \sum_{i=1}^{d} w_i^2 \right)^{1/2}$
L1 Norm (Manhattan Norm): $\| \mathbf{w} \|_1 = \sum_{i=1}^{d} |w_i|$

Understanding Norms and Their Geometric Interpretation

Unit Balls in Different Norms

L2 Norm Unit Ball (Circle in 2D):
- All points $\mathbf{w}$ satisfying $\| \mathbf{w} \|_2 = 1$ form a circle in 2D.
- Smooth boundary.
L1 Norm Unit Ball (Diamond in 2D):
- All points $\mathbf{w}$ satisfying $\| \mathbf{w} \|_1 = 1$ form a diamond shape.
- Corners align with the axes.

Norm Unit Balls

Blue Circle: L2 norm unit ball.
Red Diamond: L1 norm unit ball.

Effect on Optimization

When we impose a norm constraint on the weights, the feasible set of solutions is restricted to within the unit ball (scaled by $c$ ) of the chosen norm.

L2 Regularization:
- Encourages small weights overall.
- Does not inherently promote sparsity.
L1 Regularization:
- Encourages sparsity in the weights.
- Tends to produce models where many weights are exactly zero.

L1 vs. L2 Regularization

Graphical Illustration

Consider contour lines of the loss function (ellipses) and the constraint region (unit ball).

L2 Regularization

Constraint Region: Circle (in 2D).
Optimal Solution: Typically lies inside the constraint region.
Weights: Generally small but non-zero.

L2 Regularization

L1 Regularization

Constraint Region: Diamond (in 2D) with corners on the axes.
Optimal Solution: Often occurs at a corner of the diamond.
Weights: Many are exactly zero (sparse solution).

L1 Regularization

Why L1 Regularization Leads to Sparsity

Geometric Explanation:
- The corners of the L1 norm unit ball align with the axes, where one or more weights are zero.
- The optimal solution is likely to occur at these corners due to the shape of the constraint region intersecting with the loss contours.
Analogy:
- Smooth Object (L2): Touching a smooth surface (contour lines) at arbitrary points.
- Sharp Object (L1): Touching a surface most likely at the sharp edges or corners (axes), leading to zeros in weights.

Sparsity and Feature Selection

Sparse Solution: Many weights are exactly zero.
Advantages:
- Implicit feature selection.
- Reduces model complexity.
- Can handle high-dimensional data with limited samples.

Incorporating Regularization into Optimization

From Constrained to Unconstrained Optimization

Using Lagrange multipliers, we can convert the constrained optimization problem into an unconstrained one with a penalty term.

Original Constrained Problem

Objective: $\min_{\mathbf{w}} \; J(\mathbf{w}) \quad \text{subject to} \quad \| \mathbf{w} \| \leq c$

Unconstrained Problem with Penalty

Objective: $\min_{\mathbf{w}} \; J(\mathbf{w}) + \lambda \| \mathbf{w} \|$
$\lambda$ : Regularization parameter (Lagrange multiplier).
The norm $\| \mathbf{w} \|$ can be either L1 or L2.

Relationship Between $c$ and $\lambda$

Tradeoff Parameter:
- A larger $\lambda$ corresponds to a stricter penalty on large weights.
- The value of $\lambda$ controls the strength of the regularization.
Equivalence:
- For certain values of $\lambda$ , the unconstrained problem is equivalent to the original constrained problem.
- The exact relationship between $c$ and $\lambda$ depends on the specifics of the problem.

Example: Regularized Mean Squared Error

L2 Regularization (Ridge Regression)

Objective: $\min_{\mathbf{w}} \; \frac{1}{2M} \sum_{m=1}^{M} \left( y^{(m)} - t^{(m)} \right)^2 + \lambda \| \mathbf{w} \|_2^2$
Characteristics:
- Penalizes large weights.
- Does not encourage sparsity.

L1 Regularization (Lasso Regression)

Objective: $\min_{\mathbf{w}} \; \frac{1}{2M} \sum_{m=1}^{M} \left( y^{(m)} - t^{(m)} \right)^2 + \lambda \| \mathbf{w} \|_1$
Characteristics:
- Penalizes the absolute value of weights.
- Encourages sparsity.

Constraint Optimization and Duality

Lagrangian and Dual Problem

To solve the constrained optimization problem, we introduce the Lagrangian:

Lagrangian Function

Definition: $\mathcal{L}(\mathbf{w}, \lambda) = J(\mathbf{w}) + \lambda \left( \| \mathbf{w} \| - c \right)$
$\lambda \geq 0$ : Lagrange multiplier.

Dual Function

Definition: $g(\lambda) = \inf_{\mathbf{w}} \mathcal{L}(\mathbf{w}, \lambda)$
Dual Problem: $\sup_{\lambda \geq 0} g(\lambda)$

Weak and Strong Duality

Weak Duality:
- The optimal value of the dual problem $d^*$ is a lower bound to the primal problem $p^*$ : $d^* \leq p^*$
Strong Duality:
- Under certain conditions (e.g., convexity and Slater's condition), the optimal values are equal: $d^* = p^*$

Slater's Condition

Definition:
- If the optimization problem is convex and there exists a strictly feasible point (satisfying $\| \mathbf{w} \| < c$ ), then strong duality holds.
Implication:
- Allows us to solve the primal problem by solving the dual problem.

Practical Implementation

Choosing the Regularization Parameter $\lambda$

Hyperparameter Tuning:
- $\lambda$ is a hyperparameter that needs to be selected carefully.
- Common methods include cross-validation or validation on a hold-out dataset.
Effect of $\lambda$ :
- Large $\lambda$ : Strong regularization; may lead to underfitting.
- Small $\lambda$ : Weak regularization; may not sufficiently control overfitting.

Solving the Regularized Optimization Problem

Closed-form Solutions:
- For L2 regularization (ridge regression), a closed-form solution exists: $\hat{\mathbf{w}} = \left( \mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I} \right)^{-1} \mathbf{X}^\top \mathbf{t}$
Gradient-based Methods:
- For L1 regularization (lasso regression), we typically use iterative optimization algorithms (e.g., coordinate descent, subgradient methods).

The Need for Regularization​

Overfitting and Hypothesis Class Size​

Balancing Bias and Variance​

Regularization Techniques​

Restricting the Hypothesis Class​

Attempt 1: Include All Features (Unrestricted Model)​

Attempt 2: Manual Feature Selection​

Attempt 3: Include All Features with Regularization​

Regularization as a Constraint​

General Form​

Common Norms​

Understanding Norms and Their Geometric Interpretation​

Unit Balls in Different Norms​

Effect on Optimization​

L1 vs. L2 Regularization​

Graphical Illustration​

L2 Regularization​

L1 Regularization​

Why L1 Regularization Leads to Sparsity​

Sparsity and Feature Selection​

Incorporating Regularization into Optimization​

From Constrained to Unconstrained Optimization​

Original Constrained Problem​

Unconstrained Problem with Penalty​

Relationship Between ccc and λ\lambdaλ​

Example: Regularized Mean Squared Error​

L2 Regularization (Ridge Regression)​

L1 Regularization (Lasso Regression)​

Constraint Optimization and Duality​

Lagrangian and Dual Problem​

Lagrangian Function​

Dual Function​

Weak and Strong Duality​

Slater's Condition​

Practical Implementation​

Choosing the Regularization Parameter λ\lambdaλ​

Solving the Regularized Optimization Problem​

The Need for Regularization

Overfitting and Hypothesis Class Size

Balancing Bias and Variance

Regularization Techniques

Restricting the Hypothesis Class

Attempt 1: Include All Features (Unrestricted Model)

Attempt 2: Manual Feature Selection

Attempt 3: Include All Features with Regularization

Regularization as a Constraint

General Form

Common Norms

Understanding Norms and Their Geometric Interpretation

Unit Balls in Different Norms

Effect on Optimization

L1 vs. L2 Regularization

Graphical Illustration

L2 Regularization

L1 Regularization

Why L1 Regularization Leads to Sparsity

Sparsity and Feature Selection

Incorporating Regularization into Optimization

From Constrained to Unconstrained Optimization

Original Constrained Problem

Unconstrained Problem with Penalty

Relationship Between $c$ and $\lambda$

Example: Regularized Mean Squared Error

L2 Regularization (Ridge Regression)

L1 Regularization (Lasso Regression)

Constraint Optimization and Duality

Lagrangian and Dual Problem

Lagrangian Function

Dual Function

Weak and Strong Duality

Slater's Condition

Practical Implementation

Choosing the Regularization Parameter $\lambda$

Solving the Regularized Optimization Problem