Regularization
In this chapter, you'll learn about:
- Overfitting and Hypothesis Class Complexity: Understanding how the size of the hypothesis class affects bias and variance.
- Regularization Techniques: Introducing methods to restrict the hypothesis class to improve generalization.
- Norms and Their Role in Regularization: Exploring different norms (L1 and L2) and their geometric interpretations.
- L1 vs. L2 Regularization: Comparing the effects of L1 and L2 regularization on model weights and sparsity.
- Constraint Optimization and Duality: Understanding how constrained optimization problems relate to regularization.
- Practical Implementation: Learning how to incorporate regularization into optimization problems.
In the previous chapter, we discussed the bias-variance tradeoff in the context of linear regression. We observed that:
- High model complexity (a large hypothesis class) can lead to low bias but high variance.
- Low model complexity (a restricted hypothesis class) can lead to high bias but low variance.
Our goal is to find a balance between bias and variance to achieve good generalization performance. One way to control model complexity is through regularization, which involves adding constraints or penalties to the optimization problem to restrict the hypothesis class.
The Need for Regularization
Overfitting and Hypothesis Class Size
When the hypothesis class is too large (e.g., high-degree polynomials or models with many features), the model can:
- Fit the training data very closely, including noise.
- Exhibit high variance and poor generalization to unseen data.
Conversely, if the hypothesis class is too small, the model may:
- Be unable to capture the underlying patterns in the data.
- Exhibit high bias and underfit the data.
Balancing Bias and Variance
The bias-variance tradeoff can be visualized as:
- X-axis: Model complexity (size of the hypothesis class).
- Y-axis: Error components (bias and variance).
- Optimal Point: The sweet spot where both bias and variance are minimized.
Regularization helps us navigate this tradeoff by controlling model complexity.
Regularization Techniques
Restricting the Hypothesis Class
Consider the following attempts to control model complexity:
Attempt 1: Include All Features (Unrestricted Model)
- Hypothesis:
- Characteristics:
- Uses all available features.
- High model complexity.
- Prone to overfitting when is large relative to the number of samples.
Attempt 2: Manual Feature Selection
- Hypothesis:
- Characteristics:
- Includes only a subset of features believed to be relevant.
- Reduces model complexity.
- May exclude useful information and still overfit if is large.
Attempt 3: Include All Features with Regularization
- Hypothesis:
- Constraint:
- Characteristics:
- Includes all features.
- Restricts the magnitude of the weights.
- Controls model complexity without excluding features.
Regularization as a Constraint
We aim to minimize the loss function subject to a constraint on the weights:
General Form
- Objective:
- : Loss function (e.g., mean squared error).
- : Norm of the weight vector.
Common Norms
- L2 Norm (Euclidean Norm):
- L1 Norm (Manhattan Norm):
Understanding Norms and Their Geometric Interpretation
Unit Balls in Different Norms
- L2 Norm Unit Ball (Circle in 2D):
- All points satisfying form a circle in 2D.
- Smooth boundary.
- L1 Norm Unit Ball (Diamond in 2D):
- All points satisfying form a diamond shape.
- Corners align with the axes.
- Blue Circle: L2 norm unit ball.
- Red Diamond: L1 norm unit ball.
Effect on Optimization
When we impose a norm constraint on the weights, the feasible set of solutions is restricted to within the unit ball (scaled by ) of the chosen norm.
- L2 Regularization:
- Encourages small weights overall.
- Does not inherently promote sparsity.
- L1 Regularization:
- Encourages sparsity in the weights.
- Tends to produce models where many weights are exactly zero.
L1 vs. L2 Regularization
Graphical Illustration
Consider contour lines of the loss function (ellipses) and the constraint region (unit ball).
L2 Regularization
- Constraint Region: Circle (in 2D).
- Optimal Solution: Typically lies inside the constraint region.
- Weights: Generally small but non-zero.
L1 Regularization
- Constraint Region: Diamond (in 2D) with corners on the axes.
- Optimal Solution: Often occurs at a corner of the diamond.
- Weights: Many are exactly zero (sparse solution).
Why L1 Regularization Leads to Sparsity
- Geometric Explanation:
- The corners of the L1 norm unit ball align with the axes, where one or more weights are zero.
- The optimal solution is likely to occur at these corners due to the shape of the constraint region intersecting with the loss contours.
- Analogy:
- Smooth Object (L2): Touching a smooth surface (contour lines) at arbitrary points.
- Sharp Object (L1): Touching a surface most likely at the sharp edges or corners (axes), leading to zeros in weights.
Sparsity and Feature Selection
- Sparse Solution: Many weights are exactly zero.
- Advantages:
- Implicit feature selection.
- Reduces model complexity.
- Can handle high-dimensional data with limited samples.
Incorporating Regularization into Optimization
From Constrained to Unconstrained Optimization
Using Lagrange multipliers, we can convert the constrained optimization problem into an unconstrained one with a penalty term.
Original Constrained Problem
- Objective:
Unconstrained Problem with Penalty
- Objective:
- : Regularization parameter (Lagrange multiplier).
- The norm can be either L1 or L2.
Relationship Between and
- Tradeoff Parameter:
- A larger corresponds to a stricter penalty on large weights.
- The value of controls the strength of the regularization.
- Equivalence:
- For certain values of , the unconstrained problem is equivalent to the original constrained problem.
- The exact relationship between and depends on the specifics of the problem.
Example: Regularized Mean Squared Error
L2 Regularization (Ridge Regression)
- Objective:
- Characteristics:
- Penalizes large weights.
- Does not encourage sparsity.
L1 Regularization (Lasso Regression)
- Objective:
- Characteristics:
- Penalizes the absolute value of weights.
- Encourages sparsity.
Constraint Optimization and Duality
Lagrangian and Dual Problem
To solve the constrained optimization problem, we introduce the Lagrangian:
Lagrangian Function
- Definition:
- : Lagrange multiplier.
Dual Function
- Definition:
- Dual Problem:
Weak and Strong Duality
- Weak Duality:
- The optimal value of the dual problem is a lower bound to the primal problem :
- Strong Duality:
- Under certain conditions (e.g., convexity and Slater's condition), the optimal values are equal:
Slater's Condition
- Definition:
- If the optimization problem is convex and there exists a strictly feasible point (satisfying ), then strong duality holds.
- Implication:
- Allows us to solve the primal problem by solving the dual problem.
Practical Implementation
Choosing the Regularization Parameter
- Hyperparameter Tuning:
- is a hyperparameter that needs to be selected carefully.
- Common methods include cross-validation or validation on a hold-out dataset.
- Effect of :
- Large : Strong regularization; may lead to underfitting.
- Small : Weak regularization; may not sufficiently control overfitting.
Solving the Regularized Optimization Problem
- Closed-form Solutions:
- For L2 regularization (ridge regression), a closed-form solution exists:
- Gradient-based Methods:
- For L1 regularization (lasso regression), we typically use iterative optimization algorithms (e.g., coordinate descent, subgradient methods).
Conclusion
Regularization is a powerful technique to control model complexity and improve generalization performance by:
- Restricting the hypothesis class through constraints on the weights.
- Incorporating penalties into the loss function to discourage large weights.
- Encouraging sparsity in the model parameters, particularly with L1 regularization.
By understanding the geometric interpretations of different norms and the principles of constrained optimization, we can make informed choices about regularization techniques and effectively implement them in machine learning models.
Recap
In this chapter, we've covered:
- The Need for Regularization: Recognized the importance of controlling model complexity to balance bias and variance.
- Regularization Techniques: Explored methods to restrict the hypothesis class by imposing constraints on the weights.
- Norms and Geometric Interpretation: Understood how L1 and L2 norms shape the feasible set of solutions and their impact on the optimization problem.
- L1 vs. L2 Regularization: Compared the effects of different regularization terms on model sparsity and generalization.
- Constraint Optimization and Duality: Learned how to convert constrained optimization problems into unconstrained ones using Lagrange multipliers and the concept of duality.
- Practical Implementation: Discussed how to choose regularization parameters and solve regularized optimization problems.