Linear Regression

info

In this chapter you'll be introduced to:

Linear Regression: Understanding the fundamentals of linear regression in supervised learning.
Hypothesis Class for Linear Regression: Defining the set of functions used in linear regression models.
Linear Regression Model: Formulating the linear model with weights and bias.
Training Criteria: Introducing the mean squared error loss function used for training linear regression models.
Matrix-Vector Representation: Representing linear regression models and loss functions using matrices and vectors for computational efficiency.

Linear Regression

Linear Regression is one of the most basic algorithm in supervised learning, particularly in regression tasks where the target variable is continuous. It involves learning a linear relationship between input features and the target output.

In a supervised learning setting, we have:

Training Data: A set of $M$ examples $\{ (\mathbf{x}^{(m)}, y^{(m)}) \}_{m=1}^M$ , where:
- $\mathbf{x}^{(m)} \in \mathbb{R}^d$ is the input vector for the $m$ -th example.
- $y^{(m)} \in \mathbb{R}$ is the target output for the $m$ -th example.
Goal: Learn a function (hypothesis) $h$ that maps inputs to outputs:
$h: \mathbb{R}^d \rightarrow \mathbb{R}$
Prediction: For a new input $\mathbf{x}^\prime$ , predict the output $\hat{y}^\prime = h(\mathbf{x}^\prime)$ .

Hypothesis Class for Linear Regression

The hypothesis class in linear regression consists of all functions that can be represented as a linear combination of the input features plus a bias term. Formally, the hypothesis class $\mathcal{H}$ is defined as:

\mathcal{H} = \{ h(\mathbf{x}) \mid h(\mathbf{x}) = \mathbf{w}^\top \mathbf{x} + b, \; \mathbf{w} \in \mathbb{R}^d, \; b \in \mathbb{R} \}

$\mathbf{w}$ : Weight vector (parameters) of the model.
$b$ : Bias term (intercept).

By specifying this hypothesis class, we consider all possible affine functions parameterized by $\mathbf{w}$ and $b$ .

The Linear Regression Model

The linear regression model predicts the output $\hat{y}$ as:

\hat{y} = \mathbf{w}^\top \mathbf{x} + b

$\mathbf{x}$ : Input feature vector.
$\mathbf{w}$ : Weight vector.
$b$ : Bias term.

Example: One-Dimensional Input (

d=1

)

Input: $x \in \mathbb{R}$
Model: $y = wx + b$
Hypothesis Class: All lines (except vertical lines) in 2D space.

Parameters

Weights ( $\mathbf{w}$ ): Coefficients for each input feature.
Bias ( $b$ ): Allows the regression line (or hyperplane) to shift upward or downward.

Note: The weight and bias are constrained in the example for simplicity of the plot.

Linear vs. Affine Transformations

Linear Transformation

A linear transformation has no bias term:

\hat{y} = \mathbf{w}^\top \mathbf{x}

Properties:
- Passes through the origin.
- Does not allow shifting.

Affine Transformation

An affine transformation includes a bias term:

\hat{y} = \mathbf{w}^\top \mathbf{x} + b

Properties:
- Allows shifting of the function.
- More flexible in fitting data.

We can represent affine transformations as linear transformations in a higher-dimensional space by introducing an augmented feature vector.

Augmented Feature Vector

Define:

Extended Input Vector: $\tilde{\mathbf{x}} = [1, x_1, x_2, \dots, x_d]^\top \in \mathbb{R}^{d+1}$
Extended Weight Vector: $\tilde{\mathbf{w}} = [b, w_1, w_2, \dots, w_d]^\top \in \mathbb{R}^{d+1}$

Then, the model becomes:

\hat{y} = \tilde{\mathbf{w}}^\top \tilde{\mathbf{x}}

This representation simplifies notation and allows us to treat the bias term as part of the weight vector.

Training Criteria for Linear Regression

The goal is to find the parameters $\mathbf{w}$ and $b$ that minimize some function $J$ that measures the difference between the predicted outputs and the true targets.

\mathbf{\tilde{w}}^* = \arg\min_{\mathbf{\tilde{w}}} J

The least-squares cost function is commonly used:

J(\mathbf{\tilde{w}}) = \frac{1}{2M} \sum_{m=1}^M (h(\mathbf{x}^{(m)};\mathbf{\tilde{w}}) - y^{(m)})^2

$y^{(m)}$ : True outcome for the $m$ -th example.
$h(\mathbf{x}^{(m)};\mathbf{\tilde{w}})$ : Predicted output for the $m$ -th example, with parameters $\mathbf{\tilde{w}}$ .
$M$ : Number of training examples.

This loss function defines what is know as the Ordinary Least Squares (OLS) regression model. We're going to see in the next few chapters that this algorithm is a special case of a broader family of algorithms.

Matrix Representation

Representing the linear regression model and loss function using matrices and vectors simplifies computations, especially for large datasets.

Design Matrix

The design matrix $\mathbf{X}$ contains all input vectors stacked as rows. If including the bias term, we prepend a column of ones to $\mathbf{X}$ :

\begin{aligned} \mathbf{X} &= \begin{bmatrix} - & {\mathbf{x}^{(1)}}^{\top} & - \\ - & {\mathbf{x}^{(2)}}^{\top} & - \\ & \vdots & \\ - & {\mathbf{x}^{(M)}}^{\top} & - \end{bmatrix}\\ &= \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} & \dots & x_d^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} & \dots & x_d^{(2)} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_1^{(M)} & x_2^{(M)} & \dots & x_d^{(M)} \end{bmatrix} \quad \in \mathbb{R}^{M \times (d+1)} \end{aligned}

Weight Vector

\tilde{\mathbf{w}} = \begin{bmatrix} b \\ w_1 \\ w_2 \\ \vdots \\ w_d \end{bmatrix} \quad \in \mathbb{R}^{d+1}

Predictions

Compute predictions for all training examples simultaneously:

\mathbf{\hat{y}} = \mathbf{X} \tilde{\mathbf{w}}

$\mathbf{\hat{y}}$ : Vector of predicted outputs $\in \mathbb{R}^M$ .

Cost Function

Express the OLS loss in matrix form:

J(\tilde{\mathbf{w}}) = \frac{1}{2M} \| \mathbf{X} \tilde{\mathbf{w}} - \mathbf{y} \|^2

$\mathbf{y}$ : Vector of true targets $\in \mathbb{R}^M$ .
$\| \cdot \|$ : Euclidean norm (L2 norm).

Finding the Closed-Form Solution

To find the closed-form solution of the minimization problem, we will explicitly take the derivative of $J$ with respect to $\mathbf{\tilde{w}}$ and set it to zero:

\begin{aligned} \nabla_{\tilde{\mathbf{w}}} J(\tilde{\mathbf{w}}) &= \nabla_{\tilde{\mathbf{w}}} \frac{1}{2M} \| \mathbf{X} \tilde{\mathbf{w}} - \mathbf{y} \|^2\\ &= \frac{1}{2M} \nabla_{\tilde{\mathbf{w}}} (\mathbf{X} \tilde{\mathbf{w}} - \mathbf{y})^\top (\mathbf{X} \tilde{\mathbf{w}} - \mathbf{y})\\ &\cdots\\ &= \frac{1}{M} \mathbf{X}^\top (\mathbf{X} \tilde{\mathbf{w}} - \mathbf{y})\\ \end{aligned}

Set the gradient equal to zero:

\nabla J(\tilde{\mathbf{w}}) = \frac{1}{M} \mathbf{X}^\top (\mathbf{X} \tilde{\mathbf{w}} - \mathbf{y}) = \mathbf{0}

Multiply both sides by $M$ :

\mathbf{X}^\top (\mathbf{X} \tilde{\mathbf{w}} - \mathbf{y}) = \mathbf{0}

Rewriting it, we obtain the normal equations:

\mathbf{X}^\top \mathbf{X} \tilde{\mathbf{w}} = \mathbf{X}^\top \mathbf{y}

With solution:

\tilde{\mathbf{w}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}

$(\mathbf{X}^\top \mathbf{X})^{-1}$ : Inverse of the matrix $\mathbf{X}^\top \mathbf{X}$ .

This formula provides the closed-form solution for the optimal weights in linear regression.

Conditions for Invertibility

For the inverse $(\mathbf{X}^\top \mathbf{X})^{-1}$ to exist, the matrix $\mathbf{X}^\top \mathbf{X}$ must be invertible.

When Is $\mathbf{X}^\top \mathbf{X}$ Invertible?

Full Rank: If $\mathbf{X}^\top \mathbf{X}$ is of full rank (i.e., rank equal to $d+1$ ), then it is invertible.
Linearly Independent Columns: This requires that the columns of $\mathbf{X}$ (features) are linearly independent.

Issues with Invertibility

Duplicate Features: If two or more features are linearly dependent (e.g., one feature is a multiple of another), $\mathbf{X}^\top \mathbf{X}$ is not invertible.
More Features Than Samples: If the number of features exceeds the number of samples ( $d+1 > M$ ), the matrix cannot be full rank and is thus not invertible.

Using the Moore-Penrose Pseudo-Inverse

When $\mathbf{X}^\top \mathbf{X}$ is not invertible, we can use the Moore-Penrose pseudo-inverse to find a solution.

The Moore-Penrose pseudo-inverse of a matrix $\mathbf{A}$ , denoted as $\mathbf{A}^+$ , is a generalization of the inverse matrix.

The solution becomes:

\tilde{\mathbf{w}} = (\mathbf{X}^\top \mathbf{X})^+ \mathbf{X}^\top \mathbf{y}

$(\mathbf{X}^\top \mathbf{X})^+$ : Pseudo-inverse of $\mathbf{X}^\top \mathbf{X}$ .

This method provides the least-norm solution: Among all possible solutions, it finds the one with the smallest Euclidean norm.

Limitations of Closed-Form Solutions

While the closed-form solution is elegant and provides an exact answer under certain conditions, it has limitations:

Computational Complexity

Computing the inverse of a matrix has a computational complexity of approximately $\mathcal{O}(d^3)$ .
For high-dimensional data (large $d$ ), matrix inversion becomes computationally expensive.

Numerical Stability

If $\mathbf{X}^\top \mathbf{X}$ is close to singular or ill-conditioned, numerical errors can significantly affect the solution.
In finite-precision arithmetic, small errors can be magnified during inversion.

Overfitting with High-Dimensional Data

When $d+1 > M$ , the model can perfectly fit the training data but generalize poorly to new data.
Regularization techniques (e.g., Ridge Regression) add a penalty term to prevent overfitting but modify the closed-form solution.

Non-Invertibility Due to Collinearity

Presence of highly correlated features leads to non-invertibility; feature selection or dimensionality reduction techniques may be necessary.

Advantages of Closed-Form Solutions

Despite the limitations, closed-form solutions have significant advantages:

Simplicity and Efficiency (for Small $d$ )

Provides an explicit formula for the optimal weights.
Eliminates the need for gradient descent or other optimization methods.

Exact Solution

If computed accurately, yields the precise minimum of the loss function.
Useful as a baseline to compare with other algorithms.

Linear Regression

Linear Regression

Hypothesis Class for Linear Regression

The Linear Regression Model

Parameters

Linear Transformation

Affine Transformation

Augmented Feature Vector

Training Criteria for Linear Regression

Matrix Representation

Design Matrix

Weight Vector

Predictions

Cost Function

Finding the Closed-Form Solution

Conditions for Invertibility

When Is $\mathbf{X}^\top \mathbf{X}$ Invertible?

Issues with Invertibility

Using the Moore-Penrose Pseudo-Inverse

Computational Complexity

Numerical Stability

Overfitting with High-Dimensional Data

Non-Invertibility Due to Collinearity

Simplicity and Efficiency (for Small $d$ )

Exact Solution

Recap

1. What is the primary objective of linear regression?

2. What does the hypothesis class for linear regression consist of?

3. How do we find the best parameters to model our data in linear regression?

4. Under which condition is $\mathbf{X}^\top \mathbf{X}$ guaranteed to be invertible?

5. What is a limitation of using the closed-form solution for linear regression?

What's Next

Linear Regression​

Hypothesis Class for Linear Regression​

The Linear Regression Model​

Parameters​

Linear Transformation​

Affine Transformation​

Augmented Feature Vector​

Training Criteria for Linear Regression​

Matrix Representation​

Design Matrix​

Weight Vector​

Predictions​

Cost Function​

Finding the Closed-Form Solution​

Conditions for Invertibility​

When Is X⊤X\mathbf{X}^\top \mathbf{X}X⊤X Invertible?​

Issues with Invertibility​

Using the Moore-Penrose Pseudo-Inverse​

Computational Complexity​

Numerical Stability​

Overfitting with High-Dimensional Data​

Non-Invertibility Due to Collinearity​

Simplicity and Efficiency (for Small ddd)​

Exact Solution​

Recap​

1. What is the primary objective of linear regression?

2. What does the hypothesis class for linear regression consist of?

3. How do we find the best parameters to model our data in linear regression?

4. Under which condition is X⊤X\mathbf{X}^\top \mathbf{X}X⊤X guaranteed to be invertible?

5. What is a limitation of using the closed-form solution for linear regression?

What's Next​

Linear Regression

Hypothesis Class for Linear Regression

The Linear Regression Model

Parameters

Linear Transformation

Affine Transformation

Augmented Feature Vector

Training Criteria for Linear Regression

Matrix Representation

Design Matrix

Weight Vector

Predictions

Cost Function

Finding the Closed-Form Solution

Conditions for Invertibility

When Is $\mathbf{X}^\top \mathbf{X}$ Invertible?

Issues with Invertibility

Using the Moore-Penrose Pseudo-Inverse

Computational Complexity

Numerical Stability

Overfitting with High-Dimensional Data

Non-Invertibility Due to Collinearity

Simplicity and Efficiency (for Small $d$ )

Exact Solution

Recap

4. Under which condition is $\mathbf{X}^\top \mathbf{X}$ guaranteed to be invertible?

What's Next