MAP Estimation and Hyperparameter Tuning

info

In this chapter, you'll learn about:

Maximum A Posteriori (MAP) Estimation: Understanding the Bayesian approach to parameter estimation.
Connection Between MAP and Regularization: Interpreting regularization as a form of MAP estimation with specific priors.
Hyperparameter Tuning: Strategies for selecting hyperparameters like regularization coefficients.
Cross-Validation: Techniques to assess model performance and avoid overfitting.
Practical Considerations: Best practices in splitting data and tuning hyperparameters.

In previous chapters, we explored regularization techniques like L2 (ridge regression) and L1 (lasso regression) to prevent overfitting by penalizing large weights. We also discussed constrained optimization and how regularization can be incorporated into the loss function.

In this chapter, we delve into the Bayesian interpretation of regularization through Maximum A Posteriori (MAP) estimation. We will see how MAP estimation provides a probabilistic framework for incorporating prior beliefs about the parameters. Additionally, we'll discuss strategies for hyperparameter tuning, including cross-validation methods, to optimize model performance.

Maximum A Posteriori (MAP) Estimation

Recap of Regularized Loss Function

Consider the L2-regularized loss function for linear regression:

J(\mathbf{w}) = \frac{1}{2M} \sum_{m=1}^{M} \left( y^{(m)} - t^{(m)} \right)^2 + \frac{\lambda}{2} \| \mathbf{w} \|_2^2

$\mathbf{w}$ : Weight vector.
$\lambda$ : Regularization parameter.
$M$ : Number of training samples.

Maximum Likelihood Estimation (MLE)

Under the assumption that the target variable $t$ is generated as:

t = \mathbf{w}^\top \mathbf{x} + \varepsilon

$\varepsilon$ : Gaussian noise with zero mean and variance $\sigma_\varepsilon^2$ .

The MLE aims to find the parameter $\mathbf{w}$ that maximizes the likelihood of the observed data $D$ :

\hat{\mathbf{w}}_{\text{MLE}} = \arg \max_{\mathbf{w}} \; p(D \mid \mathbf{w})

Bayesian Interpretation and MAP Estimation

In the Bayesian framework, we consider $\mathbf{w}$ as a random variable with a prior distribution $p(\mathbf{w})$ . The MAP estimation seeks the parameter $\mathbf{w}$ that maximizes the posterior distribution given the data:

\hat{\mathbf{w}}_{\text{MAP}} = \arg \max_{\mathbf{w}} \; p(\mathbf{w} \mid D)

Using Bayes' theorem:

p(\mathbf{w} \mid D) = \frac{p(D \mid \mathbf{w}) \, p(\mathbf{w})}{p(D)}

Since $p(D)$ is constant with respect to $\mathbf{w}$ , we can focus on maximizing $p(D \mid \mathbf{w}) \, p(\mathbf{w})$ .

Incorporating the Prior

Assume a Gaussian prior over $\mathbf{w}$ :

p(\mathbf{w}) = \prod_{i=1}^{d} \frac{1}{\sqrt{2\pi \sigma_w^2}} \exp\left( -\frac{w_i^2}{2 \sigma_w^2} \right)

$\sigma_w^2$ : Variance of the prior distribution.
Zero mean prior ( $\mu = 0$ ).

Derivation of MAP Estimator

Log-Posterior

Compute the log-posterior (dropping constants):

\begin{align*} \log p(\mathbf{w} \mid D) &\propto \log p(D \mid \mathbf{w}) + \log p(\mathbf{w}) \\ &= -\frac{1}{2 \sigma_\varepsilon^2} \sum_{m=1}^{M} \left( t^{(m)} - \mathbf{w}^\top \mathbf{x}^{(m)} \right)^2 - \frac{1}{2 \sigma_w^2} \sum_{i=1}^{d} w_i^2 \end{align*}

MAP Objective Function

Maximizing the log-posterior is equivalent to minimizing:

J_{\text{MAP}}(\mathbf{w}) = \frac{1}{2 \sigma_\varepsilon^2} \sum_{m=1}^{M} \left( t^{(m)} - \mathbf{w}^\top \mathbf{x}^{(m)} \right)^2 + \frac{1}{2 \sigma_w^2} \| \mathbf{w} \|_2^2

Connection to Regularization

Let $\lambda = \frac{\sigma_\varepsilon^2}{\sigma_w^2}$ . Then:

J_{\text{MAP}}(\mathbf{w}) = \frac{1}{2 \sigma_\varepsilon^2} \left[ \sum_{m=1}^{M} \left( t^{(m)} - \mathbf{w}^\top \mathbf{x}^{(m)} \right)^2 + \lambda \| \mathbf{w} \|_2^2 \right]

Since $\sigma_\varepsilon^2$ is a constant with respect to $\mathbf{w}$ , minimizing $J_{\text{MAP}}$ is equivalent to minimizing the regularized loss function with L2 regularization.

Interpretation

Prior Variance ( $\sigma_w^2$ ):
- Large $\sigma_w^2$ : Weak prior (less regularization), allowing weights to vary more freely.
- Small $\sigma_w^2$ : Strong prior (more regularization), encouraging weights to be small.
Noise Variance ( $\sigma_\varepsilon^2$ ):
- Influences the scaling of the loss function but does not affect the relative weighting between the data fit and regularization term.

Conjugate Prior

When the prior and likelihood are both Gaussian, the posterior is also Gaussian.
This property simplifies the mathematical derivations and is known as the conjugate prior.

Hyperparameter Tuning

Importance of Hyperparameters

Hyperparameters (e.g., $\lambda$ , learning rate) are not learned during training but significantly affect model performance.
Selecting appropriate hyperparameters is crucial for balancing bias and variance.

Strategies for Hyperparameter Tuning

Grid Search

Define a discrete set of values for each hyperparameter.
Train and evaluate the model for every combination.
Computationally intensive, especially with many hyperparameters.

Random Search

Randomly sample hyperparameter values from predefined distributions.
More efficient than grid search when dealing with high-dimensional hyperparameter spaces.

Bayesian Optimization

Use probabilistic models to model the performance of hyperparameters.
Iteratively select hyperparameters that are expected to perform well.

Cross-Validation

Need for Validation

Assess model performance on unseen data.
Prevent overfitting to the training data.

Splitting the Data

Training Set: Used to train the model.
Validation Set: Used to tune hyperparameters and assess model performance during development.
Test Set: Used once to evaluate the final model's performance.

K-Fold Cross-Validation

Split the training data into $K$ folds.
For each fold:
- Train on $K-1$ folds.
- Validate on the remaining fold.
Average the performance across folds.
Helps when the dataset is small.

Leave-One-Out Cross-Validation

A special case of K-fold with $K = N$ (number of data points).
Computationally expensive.

Practical Considerations

Data Splitting Ratios

Large Datasets:
- Training: Majority of the data.
- Validation: Smaller percentage (e.g., 10,000 samples).
- Test: Similar size to validation.
Medium Datasets:
- Training: ~60%
- Validation: ~20%
- Test: ~20%
Small Datasets:
- Use K-fold cross-validation to maximize data usage.

Avoiding Data Leakage

Ensure that the test set remains untouched until the final evaluation.
Do not use test data for hyperparameter tuning.

Hyperparameter Optimization Algorithm

Initialize: Define a range of values for each hyperparameter.
For each hyperparameter configuration:
- Train the model on the training set.
- Evaluate on the validation set.
Select: Choose the hyperparameters that yield the best validation performance.
Retrain: Train the final model on the combined training and validation set using the selected hyperparameters.
Test: Evaluate the final model on the test set.

Practical Implementation Tips

Tuning Regularization Parameter ( $\lambda$ )

Start with a wide range: Use logarithmic scales (e.g., $\lambda \in \{0.001, 0.01, 0.1, 1, 10\}$ ).
Observe Training and Validation Loss:
- Overfitting: Low training loss but high validation loss. Increase $\lambda$ .
- Underfitting: High training and validation loss. Decrease $\lambda$ .

Learning Rate and Other Hyperparameters

Learning Rate ( $\eta$ ):
- Too small: Slow convergence.
- Too large: May overshoot minima or cause divergence.
Learning Rate Schedules:
- Decay: Reduce learning rate over time.
- Adaptive Methods: Use algorithms like Adam or RMSProp that adjust learning rates per parameter.

Handling Multiple Hyperparameters

Grid Search Limitations:
- Becomes impractical with more than a few hyperparameters.
Alternative Methods:
- Random Search: Often more efficient than grid search in high dimensions.
- Sequential Model-Based Optimization (SMBO): Use models to predict performance and guide the search.

Monitoring Overfitting and Underfitting

Plot training and validation loss over epochs.
Signs of Overfitting:
- Training loss continues to decrease.
- Validation loss starts increasing.
Signs of Underfitting:
- Both training and validation loss are high.
- Model is not capturing the underlying patterns.

Early Stopping

Stop training when validation loss stops improving.
Helps prevent overfitting by not over-training the model.

Maximum A Posteriori (MAP) Estimation​

Recap of Regularized Loss Function​

Maximum Likelihood Estimation (MLE)​

Bayesian Interpretation and MAP Estimation​

Incorporating the Prior​

Derivation of MAP Estimator​

Log-Posterior​

MAP Objective Function​

Connection to Regularization​

Interpretation​

Conjugate Prior​

Hyperparameter Tuning​

Importance of Hyperparameters​

Strategies for Hyperparameter Tuning​

Grid Search​

Random Search​

Bayesian Optimization​

Cross-Validation​

Need for Validation​

Splitting the Data​

K-Fold Cross-Validation​

Leave-One-Out Cross-Validation​

Practical Considerations​

Data Splitting Ratios​

Avoiding Data Leakage​

Hyperparameter Optimization Algorithm​

Practical Implementation Tips​

Tuning Regularization Parameter (λ\lambdaλ)​

Learning Rate and Other Hyperparameters​

Handling Multiple Hyperparameters​

Monitoring Overfitting and Underfitting​

Early Stopping​