Closed-Form Solution for Mean Squared Error

info

In this chapter you'll be introduced to:

Mean Squared Error (MSE) Loss Function: Revisiting the MSE loss in linear regression.
Convexity of the MSE Loss Function: Proving that the MSE loss is convex in the model parameters.
Gradient Computation: Deriving the gradient of the MSE loss with respect to the weights.
Closed-Form Solution: Finding the optimal weights by setting the gradient to zero.
Matrix Inversion and Pseudo-Inverse: Discussing invertibility of matrices and solutions when matrices are not invertible.
Limitations of Closed-Form Solutions: Understanding when closed-form solutions are practical and when they may not be suitable.
Advantages and Drawbacks: Evaluating the benefits and potential issues with using closed-form solutions in regression problems.

In the previous chapters, we explored the fundamentals of linear regression and introduced the mean squared error (MSE) as a loss function for training our models. We also discussed the concept of convexity and its importance in optimization problems, particularly in ensuring that any local minimum is also a global minimum.

In this chapter, we focus on finding the closed-form solution for the linear regression problem by minimizing the MSE loss function. This approach provides an exact solution without the need for iterative optimization algorithms. We'll delve into the mathematical derivation of the closed-form solution, discuss conditions under which it exists, and examine its advantages and limitations.

Revisiting the Mean Squared Error Loss Function

Recall that in linear regression, our goal is to find the weight vector $\mathbf{w}$ (including the bias term) that minimizes the MSE loss over the training data.

Formulation

Given:

Training Data: $\{ (\mathbf{x}^{(m)}, t^{(m)}) \}_{m=1}^M$ , where $\mathbf{x}^{(m)} \in \mathbb{R}^d$ and $t^{(m)} \in \mathbb{R}$ .
Extended Feature Vector: To include the bias term, we augment the input vector with a 1, leading to $\tilde{\mathbf{x}}^{(m)} = [1, x_1^{(m)}, x_2^{(m)}, \dots, x_d^{(m)}]^T \in \mathbb{R}^{d+1}$ .
Weight Vector: $\tilde{\mathbf{w}} = [b, w_1, w_2, \dots, w_d]^T \in \mathbb{R}^{d+1}$ .

The prediction for the $m$ -th example is:

y^{(m)} = \tilde{\mathbf{w}}^\top \tilde{\mathbf{x}}^{(m)}

The MSE loss function is:

J(\tilde{\mathbf{w}}) = \frac{1}{2M} \sum_{m=1}^M \left( y^{(m)} - t^{(m)} \right)^2

Alternatively, in matrix-vector notation:

Design Matrix: $\mathbf{X} \in \mathbb{R}^{M \times (d+1)}$ , where each row is an extended input vector $\tilde{\mathbf{x}}^{(m)}$ .
Target Vector: $\mathbf{t} = [t^{(1)}, t^{(2)}, \dots, t^{(M)}]^T \in \mathbb{R}^M$ .
Predictions: $\mathbf{y} = \mathbf{X} \tilde{\mathbf{w}}$ .

The MSE loss function becomes:

J(\tilde{\mathbf{w}}) = \frac{1}{2M} \| \mathbf{X} \tilde{\mathbf{w}} - \mathbf{t} \|^2

$\| \cdot \|$ : Denotes the Euclidean (L2) norm.
$\mathbf{X} \tilde{\mathbf{w}} - \mathbf{t}$ : Vector of residuals.

Convexity of the MSE Loss Function

To ensure that we can find the global minimum of the MSE loss function, it's important to establish that the loss function is convex with respect to the weight vector $\tilde{\mathbf{w}}$ .

Proving Convexity

We can prove convexity by showing that the second derivative (Hessian) of the loss function with respect to $\tilde{\mathbf{w}}$ is positive semidefinite.

Deriving the Gradient

First, compute the gradient of the loss function:

Express the loss function:
$J(\tilde{\mathbf{w}}) = \frac{1}{2M} (\mathbf{X} \tilde{\mathbf{w}} - \mathbf{t})^\top (\mathbf{X} \tilde{\mathbf{w}} - \mathbf{t})$
Compute the gradient with respect to $\tilde{\mathbf{w}}$ :
$\nabla J(\tilde{\mathbf{w}}) = \frac{1}{M} \mathbf{X}^\top (\mathbf{X} \tilde{\mathbf{w}} - \mathbf{t})$
- Derivation:
  - Use the fact that the gradient of $\frac{1}{2} \mathbf{a}^\top \mathbf{a}$ with respect to $\mathbf{a}$ is $\mathbf{a}$ .
  - Apply the chain rule and properties of matrix calculus.

Computing the Hessian

Compute the Hessian (second derivative) of the loss function:

\nabla^2 J(\tilde{\mathbf{w}}) = \frac{1}{M} \mathbf{X}^\top \mathbf{X}

$\mathbf{X}^\top \mathbf{X}$ is a symmetric and positive semidefinite matrix.
Positive Semidefinite: For any vector $\mathbf{v}$ , we have:
$\mathbf{v}^\top (\mathbf{X}^\top \mathbf{X}) \mathbf{v} = (\mathbf{X} \mathbf{v})^\top (\mathbf{X} \mathbf{v}) = \| \mathbf{X} \mathbf{v} \|^2 \geq 0$

Conclusion

Since the Hessian is positive semidefinite, the MSE loss function is convex with respect to $\tilde{\mathbf{w}}$ . This means:

Any local minimum is a global minimum.
Setting the gradient to zero will yield the optimal solution.

Finding the Closed-Form Solution

With the convexity established, we can proceed to find the weight vector $\tilde{\mathbf{w}}$ that minimizes the MSE loss by setting the gradient to zero.

Setting the Gradient to Zero

Set the gradient equal to zero:

\nabla J(\tilde{\mathbf{w}}) = \frac{1}{M} \mathbf{X}^\top (\mathbf{X} \tilde{\mathbf{w}} - \mathbf{t}) = \mathbf{0}

Multiply both sides by $M$ (since $M > 0$ ):

\mathbf{X}^\top (\mathbf{X} \tilde{\mathbf{w}} - \mathbf{t}) = \mathbf{0}

Solving for the Weight Vector

Rewriting the equation:

Equation:
$\mathbf{X}^\top \mathbf{X} \tilde{\mathbf{w}} = \mathbf{X}^\top \mathbf{t}$
Solution:
$\tilde{\mathbf{w}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{t}$
- $(\mathbf{X}^\top \mathbf{X})^{-1}$ : Inverse of the matrix $\mathbf{X}^\top \mathbf{X}$ .

This formula provides the closed-form solution for the optimal weights in linear regression.

Conditions for Invertibility

For the inverse $(\mathbf{X}^\top \mathbf{X})^{-1}$ to exist, the matrix $\mathbf{X}^\top \mathbf{X}$ must be invertible.

When Is $\mathbf{X}^\top \mathbf{X}$ Invertible?

Full Rank: If $\mathbf{X}^\top \mathbf{X}$ is of full rank (i.e., rank equal to $d+1$ ), then it is invertible.
Linearly Independent Columns: This requires that the columns of $\mathbf{X}$ (features) are linearly independent.

Issues with Invertibility

Duplicate Features: If two or more features are linearly dependent (e.g., one feature is a multiple of another), $\mathbf{X}^\top \mathbf{X}$ is not invertible.
More Features Than Samples: If the number of features exceeds the number of samples ( $d+1 > M$ ), the matrix cannot be full rank and is thus not invertible.

Using the Moore-Penrose Pseudo-Inverse

When $\mathbf{X}^\top \mathbf{X}$ is not invertible, we can use the Moore-Penrose pseudo-inverse to find a solution.

Definition

The Moore-Penrose pseudo-inverse of a matrix $\mathbf{A}$ , denoted as $\mathbf{A}^+$ , is a generalization of the inverse matrix.

Computing the Pseudo-Inverse Solution

The solution becomes:

\tilde{\mathbf{w}} = (\mathbf{X}^\top \mathbf{X})^+ \mathbf{X}^\top \mathbf{t}

$(\mathbf{X}^\top \mathbf{X})^+$ : Pseudo-inverse of $\mathbf{X}^\top \mathbf{X}$ .

Properties

Provides the least-norm solution: Among all possible solutions, it finds the one with the smallest Euclidean norm.
Useful when the system of equations is underdetermined (more unknowns than equations).

Limitations of Closed-Form Solutions

While the closed-form solution is elegant and provides an exact answer under certain conditions, it has limitations:

Computational Complexity

Matrix Inversion Cost: Computing the inverse of a matrix has a computational complexity of approximately $\mathcal{O}(d^3)$ .
Large Feature Sets: For high-dimensional data (large $d$ ), matrix inversion becomes computationally expensive.

Numerical Stability

Ill-Conditioned Matrices: If $\mathbf{X}^\top \mathbf{X}$ is close to singular or ill-conditioned, numerical errors can significantly affect the solution.
Floating-Point Precision: In finite-precision arithmetic, small errors can be magnified during inversion.

Overfitting with High-Dimensional Data

More Features Than Samples: When $d+1 > M$ , the model can perfectly fit the training data but generalize poorly to new data.
Need for Regularization: Regularization techniques (e.g., Ridge Regression) add a penalty term to prevent overfitting but modify the closed-form solution.

Non-Invertibility Due to Collinearity

Collinear Features: Presence of highly correlated features leads to non-invertibility.
Solution: Feature selection or dimensionality reduction techniques may be necessary.

Advantages of Closed-Form Solutions

Despite the limitations, closed-form solutions have significant advantages:

Simplicity and Efficiency (for Small $d$ )

Direct Computation: Provides an explicit formula for the optimal weights.
No Iterative Algorithms Needed: Eliminates the need for gradient descent or other optimization methods.

Exact Solution

Precision: If computed accurately, yields the precise minimum of the loss function.
Benchmarking: Useful as a baseline to compare with other algorithms.

Theoretical Insights

Understanding Model Behavior: Helps in deriving theoretical properties of the estimator.
Basis for Further Extensions: Forms the foundation for more advanced methods like Ridge Regression and Kernel Methods.

Practical Considerations

When deciding whether to use the closed-form solution:

Dataset Size: Suitable for small to medium-sized datasets with a moderate number of features.
Feature Independence: Works best when features are not highly correlated.
Computational Resources: Ensure that computing resources can handle matrix inversion.

Revisiting the Mean Squared Error Loss Function​

Formulation​

Convexity of the MSE Loss Function​

Proving Convexity​

Deriving the Gradient​

Computing the Hessian​

Conclusion​

Finding the Closed-Form Solution​

Setting the Gradient to Zero​

Solving for the Weight Vector​

Conditions for Invertibility​

When Is X⊤X\mathbf{X}^\top \mathbf{X}X⊤X Invertible?​

Issues with Invertibility​

Using the Moore-Penrose Pseudo-Inverse​

Definition​

Computing the Pseudo-Inverse Solution​

Properties​

Limitations of Closed-Form Solutions​

Computational Complexity​

Numerical Stability​

Overfitting with High-Dimensional Data​

Non-Invertibility Due to Collinearity​

Advantages of Closed-Form Solutions​

Simplicity and Efficiency (for Small ddd)​

Exact Solution​

Theoretical Insights​

Practical Considerations​