Gradient-Based Optimization

info

In this chapter, you'll be introduced to:

Gradient Descent Algorithm: Understanding how to use gradient descent for minimizing the least squares cost function in linear regression.
Initialization and Termination Criteria: Best practices for starting and stopping the optimization process.
Second-Order Methods: An introduction to Newton's method and its comparison with gradient descent.
Stochastic Gradient Descent: Exploring variations of gradient descent suitable for large datasets.

In previous chapters, we discussed linear regression and derived a closed-form solution for minimizing the least mean squares (LMS) cost funciton. While closed-form solutions are precise and computationally straightforward for small problems, they become impractical for larger, more complex models due to computational inefficiency and numerical instability. Additionally, many advanced models, such as neural networks, do not have closed-form solutions.

This chapter focuses on gradient-based optimization methods, particularly gradient descent, which iteratively adjust model parameters to minimize the loss function. These methods are widely used in machine learning because they scale well with data size and model complexity.

Gradient Descent Algorithm

Gradient descent is an optimization algorithm that finds the minimum of a function by iteratively moving in the direction of the steepest descent, as defined by the negative of the gradient.

Gradient: Represents the direction and rate of the steepest increase of the function.
Steepest Descent: Moving opposite to the gradient decreases the function's value most rapidly.

Mathematical Formulation

Given a differentiable loss function $J(\mathbf{w})$ , the update rule for gradient descent is:

\mathbf{w}_{t} = \mathbf{w}_{t-1} - \alpha \nabla J(\mathbf{w}_{t-1})

$\mathbf{w}_{t}$ : Parameter vector at iteration $t$ .
$\alpha$ : Learning rate (step size).
$\nabla J(\mathbf{w}_{t-1})$ : Gradient of the loss function at $\mathbf{w}_{t-1}$ .

Algorithm Steps

Initialization:
- Start with an initial guess $\mathbf{w}_0$ , which can be zeros or small random values.
Iteration:
- For each iteration $t$ $t$ :
  - Compute the gradient $\nabla J(\mathbf{w}_{t-1})$ .
  - Update the parameters using the update rule.
Termination:
- Stop when a stopping criterion is met (e.g., a maximum number of iterations, minimal change in loss function, or gradient norm below a threshold).

In a one-dimensional case, the gradient is the slope of the tangent at a point on the function curve.

In higher dimensions, the gradient is a vector pointing in the direction of the steepest ascent.
By moving opposite to the gradient, we descend towards the function's minimum.

Choosing the Learning Rate

Too Small ( $\alpha$ $α$ is small):
- Slow convergence.
- May get stuck in local minima or plateaus.
Too Large ( $\alpha$ $α$ is large):
- May overshoot the minimum.
- Can cause divergence or oscillations.
Adaptive Methods:
- Adjust the learning rate during training.
- Use techniques like learning rate decay or adaptive learning rates per parameter.

Initialization Strategies

Zero Initialization:
- Simple but may not break symmetry in certain models (more about this when we talk about neural networks).
- Suitable for linear regression since it's a convex problem.
Random Initialization:
- Small random values (e.g., drawn from a uniform or normal distribution).
- Helps in non-convex problems to avoid symmetry and provide diverse starting points.

Termination Conditions

Fixed Number of Iterations:
- Set a predefined number of iterations (or epochs).
Convergence of Loss Function:
- Stop when the change in the loss function between iterations falls below a threshold.
Gradient Norm:
- Terminate when the magnitude of the gradient is below a certain level, indicating a flat region.
Validation Performance:
- Use a separate validation set to monitor performance.
- Stop when the validation loss stops decreasing or starts increasing (early stopping).

Newton's Method

Newton's method is an optimization algorithm that uses second-order derivatives (the Hessian) to find the minimum of a function more rapidly than gradient descent.

The update rule for Newton's method is:

\mathbf{w}_{t} = \mathbf{w}_{t-1} - [\nabla^2 J(\mathbf{w}_{t-1})]^{-1} \nabla J(\mathbf{w}_{t-1})

$\nabla^2 J(\mathbf{w}_{t-1})$ : Hessian matrix of second-order partial derivatives.
Inverts the Hessian to adjust the step size and direction.

Advantages

Faster Convergence:
- Quadratic convergence near the minimum.
- Requires fewer iterations than gradient descent.

Disadvantages

Computationally Intensive:
- Calculating and inverting the Hessian matrix is expensive, especially for high-dimensional data.
Memory Usage:
- Storing the Hessian can be impractical for large models.
Not Suitable for Large-Scale Problems:
- Gradient descent is often preferred for large datasets and models.

Practical Considerations in Gradient Descent

Learning Rate Scheduling

Adjusting the learning rate during training can improve convergence.

Decay Schedules:
- Exponential Decay: $\alpha_t = \alpha_0 \times e^{-\lambda t}$
- Step Decay: Reduce $\alpha$ by a factor every few epochs.
- Adaptive Methods: Adjust $\alpha$ based on the behavior of the loss function or gradients.

Warm-Up and Cool-Down Phases

Warm-Up:
- Start with a smaller learning rate and gradually increase it.
- Helps in stabilizing the training in the initial phases.
Cool-Down:
- Gradually decrease the learning rate towards the end of training.
- Allows fine-tuning and settling into a minimum.

Momentum

Incorporate the past gradients to smooth out updates and accelerate convergence.

Update Rule with Momentum:
$\begin{align*} \mathbf{v}_t &= \beta \mathbf{v}_{t-1} + (1 - \beta) \nabla J(\mathbf{w}_{t-1}) \\ \mathbf{w}_t &= \mathbf{w}_{t-1} - \alpha \mathbf{v}_t \end{align*}$
Benefits:
- Reduces oscillations.
- Helps navigate ravines and narrow valleys in the loss surface.

Adaptive Learning Rates

Adjust the learning rate for each parameter individually.

Algorithms:
- AdaGrad: Adapts learning rate based on the cumulative sum of squared gradients.
- RMSProp: Uses a moving average of squared gradients.
- Adam: Combines momentum and adaptive learning rates.

Stochastic Gradient Descent

Full-Batch Gradient Descent

Definition:
- Uses the entire training dataset to compute the gradient at each iteration.
Pros:
- Accurate gradient estimation.
Cons:
- Computationally expensive for large datasets.
- Updates are infrequent.

Stochastic Gradient Descent (SGD)

Definition:
- Uses one randomly selected training example to compute the gradient at each iteration.
Pros:
- Faster, more frequent updates.
- Can escape local minima due to noise in the updates.
Cons:
- High variance in gradient estimates.
- May not converge smoothly.

Mini-Batch Gradient Descent

Definition:
- Uses a small subset (mini-batch) of training data to compute the gradient.
Batch Size:
- Typically ranges from 16 to 512 examples.
Pros:
- Balances between speed of SGD and stability of full-batch gradient descent.
- Efficient computation using vectorization and parallelism.
Cons:
- Requires choosing an appropriate batch size.

Algorithm Steps for Mini-Batch Gradient Descent

Shuffle Data:
- Randomly shuffle the training data to ensure batches are representative.
Divide into Mini-Batches:
- Split data into mini-batches of size $B$ .
Iterate Over Mini-Batches:
- For each mini-batch $b$ $b$ :
  - Compute gradient $\nabla J_b(\mathbf{w}_{t-1})$ .
  - Update parameters: $\mathbf{w}_{t} = \mathbf{w}_{t-1} - \alpha \nabla J_b(\mathbf{w}_{t-1})$
Repeat for Multiple Epochs:
- Continue iterating over the dataset multiple times.

Considerations in Gradient-Based Optimization

Choosing the Right Learning Rate

Experiment with different learning rates.
Use techniques like cross-validation to select the best value.
Monitor training loss for signs of divergence or slow convergence.

Dealing with Local Minima and Saddle Points

Non-convex loss surfaces can have multiple local minima and saddle points.
Strategies:
- Use random restarts.
- Apply momentum or adaptive learning rates.
- Increase batch size for smoother gradients.

Handling Ill-Conditioned Problems

Loss surfaces with steep and flat regions can cause slow convergence.
Adaptive methods and second-order information can help.
Preprocessing data (e.g., normalization) may improve conditioning.

Practical Tips

Monitor Training Progress:
- Plot training and validation loss over time.
- Look for signs of overfitting or underfitting.
Parameter Initialization:
- Use heuristics or initialization methods suitable for the specific model.
Regularization:
- Apply techniques like L1 or L2 regularization to prevent overfitting.
Data Preprocessing:
- Normalize or standardize input features to improve convergence.
Debugging:
- Start with a small dataset to ensure the implementation is correct.
- Gradually scale up to larger datasets.

Gradient Descent Algorithm​

Mathematical Formulation​

Algorithm Steps​

Choosing the Learning Rate​

Initialization Strategies​

Termination Conditions​

Newton's Method​

Practical Considerations in Gradient Descent​

Learning Rate Scheduling​

Warm-Up and Cool-Down Phases​

Momentum​

Adaptive Learning Rates​

Stochastic Gradient Descent​

Full-Batch Gradient Descent​

Stochastic Gradient Descent (SGD)​

Mini-Batch Gradient Descent​

Algorithm Steps for Mini-Batch Gradient Descent​

Choosing the Right Learning Rate​

Dealing with Local Minima and Saddle Points​

Handling Ill-Conditioned Problems​

Practical Tips​