Bias-Variance Tradeoff

info

In this chapter, you'll be introduced to:

Bias and Variance Concepts: Understanding the sources of error in model predictions.
Unbiased Estimators: Exploring conditions under which an estimator is unbiased.
Gauss-Markov Theorem: Learning about the properties of the ordinary least squares estimator.
Bias-Variance Decomposition: Decomposing the expected prediction error into bias and variance components.
Tradeoff Between Bias and Variance: Understanding how model complexity affects bias and variance.
Practical Implications: Recognizing the impact of the bias-variance tradeoff on model selection and generalization.

In previous chapters, we discussed linear regression, mean squared error (MSE) loss, and the probabilistic interpretation of the linear regression model using maximum likelihood estimation (MLE). We showed that minimizing the MSE loss is equivalent to maximizing the likelihood under a Gaussian noise assumption.

In this chapter, we delve deeper into the statistical properties of estimators used in linear regression, focusing on the bias-variance tradeoff. This fundamental concept in machine learning and statistics helps us understand how different sources of error contribute to a model's performance, particularly in terms of generalization to unseen data.

Recap of Linear Regression and MLE

Linear Regression Model

The linear regression model predicts a continuous target variable $t$ using a linear function of the input features $\mathbf{x}$ :

y(\mathbf{x}) = \mathbf{w}^\top \mathbf{x} + b

$\mathbf{w}$ : Weight vector (parameters).
$b$ : Bias term (intercept).

By including an additional feature with a constant value of 1, we can incorporate the bias term into the weight vector:

\tilde{\mathbf{x}} = [1, x_1, x_2, \dots, x_d]^\top \\ \tilde{\mathbf{w}} = [b, w_1, w_2, \dots, w_d]^\top

Mean Squared Error Loss

The MSE loss function measures the average squared difference between the predicted and true target values:

J(\tilde{\mathbf{w}}) = \frac{1}{2M} \sum_{m=1}^{M} \left( y^{(m)} - t^{(m)} \right)^2

Maximum Likelihood Estimation

Assuming the target variable is generated by:

t^{(m)} = \tilde{\boldsymbol{\omega}}^\top \tilde{\mathbf{x}}^{(m)} + \varepsilon^{(m)}

$\tilde{\boldsymbol{\omega}}$ : True (but unknown) parameter vector.
$\varepsilon^{(m)}$ : Independent Gaussian noise with zero mean and variance $\sigma^2$ .

The MLE of $\tilde{\mathbf{w}}$ is obtained by maximizing the likelihood of the observed data, which leads to minimizing the MSE loss.

Unbiased Estimators and the Gauss-Markov Theorem

Unbiased Estimators

An estimator $\hat{\theta}$ of a parameter $\theta$ is unbiased if:

\mathbb{E}[\hat{\theta}] = \theta

This means that, on average, the estimator correctly estimates the true parameter value.

Ordinary Least Squares Estimator

In linear regression, the ordinary least squares (OLS) estimator is given by:

\hat{\mathbf{w}}_{\text{OLS}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{t}

$\mathbf{X}$ : Design matrix containing the input features.
$\mathbf{t}$ : Vector of target values.

Theorem: OLS Estimator is Unbiased

Assumptions:

Linear Relationship: The true relationship between inputs and targets is linear:
$\mathbf{t} = \mathbf{X} \boldsymbol{\omega} + \boldsymbol{\varepsilon}$
- $\boldsymbol{\omega}$ : True parameter vector.
- $\boldsymbol{\varepsilon}$ : Vector of independent noise terms with $\mathbb{E}[\boldsymbol{\varepsilon}] = \mathbf{0}$ .
Full Rank: $\mathbf{X}^\top \mathbf{X}$ is invertible.

Proof:

The expectation of the OLS estimator is:

\begin{align*} \mathbb{E}[\hat{\mathbf{w}}_{\text{OLS}}] &= \mathbb{E}\left[ (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{t} \right] \\ &= (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbb{E}[\mathbf{t}] \\ &= (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top (\mathbf{X} \boldsymbol{\omega}) \\ &= \boldsymbol{\omega} \end{align*}

Therefore, the OLS estimator is unbiased.

Gauss-Markov Theorem

The Gauss-Markov theorem states that under the linear model assumptions with zero-mean, independent errors, the OLS estimator is the Best Linear Unbiased Estimator (BLUE). This means:

Best: Has the smallest variance among all unbiased linear estimators.
Linear: The estimator is a linear function of the observed data.
Unbiased: The expected value of the estimator equals the true parameter.

Limitations of the Linear Model Assumption

While the OLS estimator is unbiased under the linear model with Gaussian noise, this assumption may not hold in practice:

Nonlinear Relationships: The true relationship between inputs and targets may be nonlinear.
Heteroscedasticity: The variance of the errors may not be constant.
Model Misspecification: Incorrect model assumptions can lead to biased or inefficient estimators.

Therefore, it's important to consider more general settings and understand how errors decompose in such cases.

Bias-Variance Decomposition

General Setting

Assume the target variable is generated by:

t = f(\mathbf{x}) + \varepsilon

$f(\mathbf{x})$ : True (but unknown) target function.
$\varepsilon$ : Noise term with $\mathbb{E}[\varepsilon] = 0$ and variance $\sigma^2$ .
$\varepsilon$ is independent of $\mathbf{x}$ .

Our goal is to analyze the expected prediction error of a model $h(\mathbf{x})$ trained on dataset $\mathcal{D}$ , when making predictions on unseen data.

Expected Prediction Error

The expected mean squared error (MSE) over the data distribution $P(\mathbf{x}, t)$ is:

\mathbb{E}_{\mathbf{x}, t, \mathcal{D}} \left[ \left( h(\mathbf{x}) - t \right)^2 \right]

Decomposing the Error

We can decompose the expected error into bias, variance, and irreducible error components.

Step 1: Expand the Error

Substitute $t = f(\mathbf{x}) + \varepsilon$ :

\mathbb{E}_{\mathbf{x}, \varepsilon, \mathcal{D}} \left[ \left( h(\mathbf{x}) - f(\mathbf{x}) - \varepsilon \right)^2 \right]

Step 2: Expand the Square

\begin{align*} &\mathbb{E}_{\mathbf{x}, \varepsilon, \mathcal{D}} \left[ \left( h(\mathbf{x}) - f(\mathbf{x}) \right)^2 - 2 \left( h(\mathbf{x}) - f(\mathbf{x}) \right) \varepsilon + \varepsilon^2 \right] \\ =& \mathbb{E}_{\mathbf{x}, \mathcal{D}} \left[ \left( h(\mathbf{x}) - f(\mathbf{x}) \right)^2 \right] - 2 \mathbb{E}_{\mathbf{x}, \mathcal{D}} \left[ \left( h(\mathbf{x}) - f(\mathbf{x}) \right) \mathbb{E}_\varepsilon [\varepsilon] \right] + \mathbb{E}_\varepsilon [\varepsilon^2] \end{align*}

Since $\mathbb{E}_\varepsilon [\varepsilon] = 0$ and $\mathbb{E}_\varepsilon [\varepsilon^2] = \sigma^2$ :

\mathbb{E}_{\mathbf{x}, \mathcal{D}} \left[ \left( h(\mathbf{x}) - f(\mathbf{x}) \right)^2 \right] + \sigma^2

Step 3: Decompose the First Term

Introduce the expected model prediction $\bar{h}(\mathbf{x}) = \mathbb{E}_{\mathcal{D}}[h(\mathbf{x})]$ :

\left( \mathbb{E}_{\mathcal{D}}[h(\mathbf{x})] - f(\mathbf{x}) \right)^2 + \mathbb{E}_{\mathcal{D}} \left[ \left( h(\mathbf{x}) - \mathbb{E}_{\mathcal{D}}[h(\mathbf{x})] \right)^2 \right]

Step 4: Final Bias-Variance Decomposition

The expected error becomes:

\mathbb{E}_{\mathbf{x}} \left[ \left( \text{Bias}[\mathbf{x}] \right)^2 + \text{Variance}[\mathbf{x}] \right] + \sigma^2

Bias at $\mathbf{x}$ :
$\text{Bias}[\mathbf{x}] = \mathbb{E}_{\mathcal{D}}[h(\mathbf{x})] - f(\mathbf{x})$
Variance at $\mathbf{x}$ :
$\text{Variance}[\mathbf{x}] = \mathbb{E}_{\mathcal{D}} \left[ \left( h(\mathbf{x}) - \mathbb{E}_{\mathcal{D}}[h(\mathbf{x})] \right)^2 \right]$
Irreducible Error: $\sigma^2$

Interpretation

Bias measures the error introduced by approximating the true function $f(\mathbf{x})$ with the average model $\bar{h}(\mathbf{x})$ .
Variance measures the variability of the model predictions $h(\mathbf{x})$ around their mean $\bar{h}(\mathbf{x})$ due to different training datasets.
Irreducible Error ( $\sigma^2$ ) is the inherent noise in the data that cannot be reduced by any model.

The Bias-Variance Tradeoff

Model Complexity and Bias-Variance

Simple Models (e.g., linear models):
- High Bias: Unable to capture complex patterns (underfitting).
- Low Variance: Stable predictions across different datasets.
Complex Models (e.g., high-degree polynomials, deep neural networks):
- Low Bias: Can fit complex patterns (potentially overfitting).
- High Variance: Sensitive to training data variations.

Tradeoff

Goal: Find a balance between bias and variance to minimize the total expected error.
Underfitting: High bias dominates the error.
Overfitting: High variance dominates the error.

Visual Illustration

Bias-Variance Tradeoff

Source: Wikipedia

The total error curve shows the optimal model complexity that minimizes the expected error.

Examples

Polynomial Regression

Suppose the true relationship is quadratic:

t = a_0 + a_1 x + a_2 x^2 + \varepsilon

We consider fitting models of varying degrees:

Degree 1 (Linear Model):
- High Bias: Cannot capture the curvature.
- Low Variance: Predictions are consistent across datasets.
- Error: Dominated by bias.
Degree 2 (Quadratic Model):
- Low Bias: Correct model specification.
- Low Variance: Predictions are accurate and consistent.
- Error: Minimal.
Degree 10 (High-Degree Polynomial):
- Low Bias: Can fit complex patterns.
- High Variance: Predictions vary greatly with different datasets due to overfitting.
- Error: Dominated by variance.

Graphical Illustration

High Bias (Underfitting):
Optimal Fit:
High Variance (Overfitting):

Note: Images are illustrative examples of underfitting and overfitting.

Practical Implications

Model Selection

Regularization: Techniques like L1 (Lasso) and L2 (Ridge) regularization add penalties to the loss function to prevent overfitting, effectively controlling variance.
Cross-Validation: Use methods like k-fold cross-validation to estimate the expected prediction error and select models that generalize well.
Ensemble Methods: Techniques like bagging and boosting can reduce variance without significantly increasing bias.

Data Considerations

More Data: Increasing the size of the training data can help reduce variance.
Feature Engineering: Selecting relevant features can reduce bias by providing the model with more predictive information.

Algorithmic Choices

Simpler Models: Prefer simpler models when data is limited to avoid high variance.
Complex Models: Use more complex models when ample data is available and the underlying relationship is complex.

Recap of Linear Regression and MLE​

Linear Regression Model​

Mean Squared Error Loss​

Maximum Likelihood Estimation​

Unbiased Estimators and the Gauss-Markov Theorem​

Unbiased Estimators​

Ordinary Least Squares Estimator​

Theorem: OLS Estimator is Unbiased​

Gauss-Markov Theorem​

Limitations of the Linear Model Assumption​

Bias-Variance Decomposition​

General Setting​

Expected Prediction Error​

Decomposing the Error​

Step 1: Expand the Error​

Step 2: Expand the Square​

Step 3: Decompose the First Term​

Step 4: Final Bias-Variance Decomposition​

Interpretation​

The Bias-Variance Tradeoff​

Model Complexity and Bias-Variance​

Tradeoff​

Visual Illustration​

Examples​

Polynomial Regression​

Graphical Illustration​

Practical Implications​

Model Selection​

Data Considerations​

Algorithmic Choices​