Bias-Variance Tradeoff
In this chapter, you'll be introduced to:
- Bias and Variance Concepts: Understanding the sources of error in model predictions.
- Unbiased Estimators: Exploring conditions under which an estimator is unbiased.
- Gauss-Markov Theorem: Learning about the properties of the ordinary least squares estimator.
- Bias-Variance Decomposition: Decomposing the expected prediction error into bias and variance components.
- Tradeoff Between Bias and Variance: Understanding how model complexity affects bias and variance.
- Practical Implications: Recognizing the impact of the bias-variance tradeoff on model selection and generalization.
In previous chapters, we discussed linear regression, mean squared error (MSE) loss, and the probabilistic interpretation of the linear regression model using maximum likelihood estimation (MLE). We showed that minimizing the MSE loss is equivalent to maximizing the likelihood under a Gaussian noise assumption.
In this chapter, we delve deeper into the statistical properties of estimators used in linear regression, focusing on the bias-variance tradeoff. This fundamental concept in machine learning and statistics helps us understand how different sources of error contribute to a model's performance, particularly in terms of generalization to unseen data.
Recap of Linear Regression and MLE
Linear Regression Model
The linear regression model predicts a continuous target variable using a linear function of the input features :
- : Weight vector (parameters).
- : Bias term (intercept).
By including an additional feature with a constant value of 1, we can incorporate the bias term into the weight vector:
Mean Squared Error Loss
The MSE loss function measures the average squared difference between the predicted and true target values:
Maximum Likelihood Estimation
Assuming the target variable is generated by:
- : True (but unknown) parameter vector.
- : Independent Gaussian noise with zero mean and variance .
The MLE of is obtained by maximizing the likelihood of the observed data, which leads to minimizing the MSE loss.
Unbiased Estimators and the Gauss-Markov Theorem
Unbiased Estimators
An estimator of a parameter is unbiased if:
This means that, on average, the estimator correctly estimates the true parameter value.
Ordinary Least Squares Estimator
In linear regression, the ordinary least squares (OLS) estimator is given by:
- : Design matrix containing the input features.
- : Vector of target values.
Theorem: OLS Estimator is Unbiased
Assumptions:
-
Linear Relationship: The true relationship between inputs and targets is linear:
- : True parameter vector.
- : Vector of independent noise terms with .
-
Full Rank: is invertible.
Proof:
The expectation of the OLS estimator is:
Therefore, the OLS estimator is unbiased.
Gauss-Markov Theorem
The Gauss-Markov theorem states that under the linear model assumptions with zero-mean, independent errors, the OLS estimator is the Best Linear Unbiased Estimator (BLUE). This means:
- Best: Has the smallest variance among all unbiased linear estimators.
- Linear: The estimator is a linear function of the observed data.
- Unbiased: The expected value of the estimator equals the true parameter.
Limitations of the Linear Model Assumption
While the OLS estimator is unbiased under the linear model with Gaussian noise, this assumption may not hold in practice:
- Nonlinear Relationships: The true relationship between inputs and targets may be nonlinear.
- Heteroscedasticity: The variance of the errors may not be constant.
- Model Misspecification: Incorrect model assumptions can lead to biased or inefficient estimators.
Therefore, it's important to consider more general settings and understand how errors decompose in such cases.
Bias-Variance Decomposition
General Setting
Assume the target variable is generated by:
- : True (but unknown) target function.
- : Noise term with and variance .
- is independent of .
Our goal is to analyze the expected prediction error of a model trained on dataset , when making predictions on unseen data.
Expected Prediction Error
The expected mean squared error (MSE) over the data distribution is:
Decomposing the Error
We can decompose the expected error into bias, variance, and irreducible error components.
Step 1: Expand the Error
Substitute :
Step 2: Expand the Square
Since and :
Step 3: Decompose the First Term
Introduce the expected model prediction :
Step 4: Final Bias-Variance Decomposition
The expected error becomes:
-
Bias at :
-
Variance at :
-
Irreducible Error:
Interpretation
- Bias measures the error introduced by approximating the true function with the average model .
- Variance measures the variability of the model predictions around their mean due to different training datasets.
- Irreducible Error () is the inherent noise in the data that cannot be reduced by any model.
The Bias-Variance Tradeoff
Model Complexity and Bias-Variance
-
Simple Models (e.g., linear models):
- High Bias: Unable to capture complex patterns (underfitting).
- Low Variance: Stable predictions across different datasets.
-
Complex Models (e.g., high-degree polynomials, deep neural networks):
- Low Bias: Can fit complex patterns (potentially overfitting).
- High Variance: Sensitive to training data variations.
Tradeoff
- Goal: Find a balance between bias and variance to minimize the total expected error.
- Underfitting: High bias dominates the error.
- Overfitting: High variance dominates the error.
Visual Illustration
Source: Wikipedia
- The total error curve shows the optimal model complexity that minimizes the expected error.
Examples
Polynomial Regression
Suppose the true relationship is quadratic:
We consider fitting models of varying degrees:
-
Degree 1 (Linear Model):
- High Bias: Cannot capture the curvature.
- Low Variance: Predictions are consistent across datasets.
- Error: Dominated by bias.
-
Degree 2 (Quadratic Model):
- Low Bias: Correct model specification.
- Low Variance: Predictions are accurate and consistent.
- Error: Minimal.
-
Degree 10 (High-Degree Polynomial):
- Low Bias: Can fit complex patterns.
- High Variance: Predictions vary greatly with different datasets due to overfitting.
- Error: Dominated by variance.
Graphical Illustration
-
High Bias (Underfitting):
-
Optimal Fit:
-
High Variance (Overfitting):
Note: Images are illustrative examples of underfitting and overfitting.
Practical Implications
Model Selection
- Regularization: Techniques like L1 (Lasso) and L2 (Ridge) regularization add penalties to the loss function to prevent overfitting, effectively controlling variance.
- Cross-Validation: Use methods like k-fold cross-validation to estimate the expected prediction error and select models that generalize well.
- Ensemble Methods: Techniques like bagging and boosting can reduce variance without significantly increasing bias.
Data Considerations
- More Data: Increasing the size of the training data can help reduce variance.
- Feature Engineering: Selecting relevant features can reduce bias by providing the model with more predictive information.
Algorithmic Choices
- Simpler Models: Prefer simpler models when data is limited to avoid high variance.
- Complex Models: Use more complex models when ample data is available and the underlying relationship is complex.
Conclusion
The bias-variance tradeoff is a fundamental concept that explains the relationship between model complexity, generalization performance, and the sources of prediction error. Understanding this tradeoff helps practitioners make informed decisions about model selection, regularization, and hyperparameter tuning to achieve optimal performance.
By balancing bias and variance, we aim to develop models that generalize well to unseen data, which is the ultimate goal of machine learning.
Recap
In this chapter, we've covered:
- Unbiased Estimators: Defined and proved that the OLS estimator is unbiased under certain assumptions.
- Gauss-Markov Theorem: Learned that the OLS estimator is the best linear unbiased estimator.
- Bias-Variance Decomposition: Decomposed the expected prediction error into bias squared, variance, and irreducible error.
- Tradeoff Between Bias and Variance: Explored how model complexity affects bias and variance.
- Practical Implications: Discussed strategies for managing the bias-variance tradeoff in model selection and training.