Skip to main content

Linear Regression

info

In this chapter you'll be introduced to:

  • Linear Regression: Understanding the fundamentals of linear regression in supervised learning.
  • Hypothesis Class for Linear Regression: Defining the set of functions used in linear regression models.
  • Linear Regression Model: Formulating the linear model with weights and bias.
  • Training Criteria: Introducing the mean squared error loss function used for training linear regression models.
  • Matrix-Vector Representation: Representing linear regression models and loss functions using matrices and vectors for computational efficiency.

Linear Regression

Linear Regression is one of the most basic algorithm in supervised learning, particularly in regression tasks where the target variable is continuous. It involves learning a linear relationship between input features and the target output.

In a supervised learning setting, we have:

  • Training Data: A set of MM examples {(x(m),y(m))}m=1M\{ (\mathbf{x}^{(m)}, y^{(m)}) \}_{m=1}^M, where:

    • x(m)Rd\mathbf{x}^{(m)} \in \mathbb{R}^d is the input vector for the mm-th example.
    • y(m)Ry^{(m)} \in \mathbb{R} is the target output for the mm-th example.
  • Goal: Learn a function (hypothesis) hh that maps inputs to outputs:

    h:RdRh: \mathbb{R}^d \rightarrow \mathbb{R}
  • Prediction: For a new input x\mathbf{x}^\prime, predict the output y^=h(x)\hat{y}^\prime = h(\mathbf{x}^\prime).

Hypothesis Class for Linear Regression

The hypothesis class in linear regression consists of all functions that can be represented as a linear combination of the input features plus a bias term. Formally, the hypothesis class H\mathcal{H} is defined as:

H={h(x)h(x)=wx+b,  wRd,  bR}\mathcal{H} = \{ h(\mathbf{x}) \mid h(\mathbf{x}) = \mathbf{w}^\top \mathbf{x} + b, \; \mathbf{w} \in \mathbb{R}^d, \; b \in \mathbb{R} \}
  • w\mathbf{w}: Weight vector (parameters) of the model.
  • bb: Bias term (intercept).

By specifying this hypothesis class, we consider all possible affine functions parameterized by w\mathbf{w} and bb.

The Linear Regression Model

The linear regression model predicts the output y^\hat{y} as:

y^=wx+b\hat{y} = \mathbf{w}^\top \mathbf{x} + b
  • x\mathbf{x}: Input feature vector.
  • w\mathbf{w}: Weight vector.
  • bb: Bias term.
Example: One-Dimensional Input (d=1d=1)
  • Input: xRx \in \mathbb{R}
  • Model: y=wx+by = wx + b
  • Hypothesis Class: All lines (except vertical lines) in 2D space.

Parameters

  • Weights (w\mathbf{w}): Coefficients for each input feature.
  • Bias (bb): Allows the regression line (or hyperplane) to shift upward or downward.
Loading...

Note: The weight and bias are constrained in the example for simplicity of the plot.

Linear vs. Affine Transformations

Linear Transformation

A linear transformation has no bias term:

y^=wx\hat{y} = \mathbf{w}^\top \mathbf{x}
  • Properties:
    • Passes through the origin.
    • Does not allow shifting.

Affine Transformation

An affine transformation includes a bias term:

y^=wx+b\hat{y} = \mathbf{w}^\top \mathbf{x} + b
  • Properties:
    • Allows shifting of the function.
    • More flexible in fitting data.

We can represent affine transformations as linear transformations in a higher-dimensional space by introducing an augmented feature vector.

Augmented Feature Vector

Define:

  • Extended Input Vector: x~=[1,x1,x2,,xd]Rd+1\tilde{\mathbf{x}} = [1, x_1, x_2, \dots, x_d]^\top \in \mathbb{R}^{d+1}
  • Extended Weight Vector: w~=[b,w1,w2,,wd]Rd+1\tilde{\mathbf{w}} = [b, w_1, w_2, \dots, w_d]^\top \in \mathbb{R}^{d+1}

Then, the model becomes:

y^=w~x~\hat{y} = \tilde{\mathbf{w}}^\top \tilde{\mathbf{x}}

This representation simplifies notation and allows us to treat the bias term as part of the weight vector.

Training Criteria for Linear Regression

The goal is to find the parameters w\mathbf{w} and bb that minimize some function JJ that measures the difference between the predicted outputs and the true targets.

w~=argminw~J\mathbf{\tilde{w}}^* = \arg\min_{\mathbf{\tilde{w}}} J

The least-squares cost function is commonly used:

J(w~)=12Mm=1M(h(x(m);w~)y(m))2J(\mathbf{\tilde{w}}) = \frac{1}{2M} \sum_{m=1}^M (h(\mathbf{x}^{(m)};\mathbf{\tilde{w}}) - y^{(m)})^2
  • y(m)y^{(m)}: True outcome for the mm-th example.
  • h(x(m);w~)h(\mathbf{x}^{(m)};\mathbf{\tilde{w}}): Predicted output for the mm-th example, with parameters w~\mathbf{\tilde{w}}.
  • MM: Number of training examples.

This loss function defines what is know as the Ordinary Least Squares (OLS) regression model. We're going to see in the next few chapters that this algorithm is a special case of a broader family of algorithms.

Matrix Representation

Representing the linear regression model and loss function using matrices and vectors simplifies computations, especially for large datasets.

Design Matrix

The design matrix X\mathbf{X} contains all input vectors stacked as rows. If including the bias term, we prepend a column of ones to X\mathbf{X}:

X=[x(1)x(2)x(M)]=[1x1(1)x2(1)xd(1)1x1(2)x2(2)xd(2)1x1(M)x2(M)xd(M)]RM×(d+1)\begin{aligned} \mathbf{X} &= \begin{bmatrix} - & {\mathbf{x}^{(1)}}^{\top} & - \\ - & {\mathbf{x}^{(2)}}^{\top} & - \\ & \vdots & \\ - & {\mathbf{x}^{(M)}}^{\top} & - \end{bmatrix}\\ &= \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} & \dots & x_d^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} & \dots & x_d^{(2)} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_1^{(M)} & x_2^{(M)} & \dots & x_d^{(M)} \end{bmatrix} \quad \in \mathbb{R}^{M \times (d+1)} \end{aligned}

Weight Vector

w~=[bw1w2wd]Rd+1\tilde{\mathbf{w}} = \begin{bmatrix} b \\ w_1 \\ w_2 \\ \vdots \\ w_d \end{bmatrix} \quad \in \mathbb{R}^{d+1}

Predictions

Compute predictions for all training examples simultaneously:

y^=Xw~\mathbf{\hat{y}} = \mathbf{X} \tilde{\mathbf{w}}
  • y^\mathbf{\hat{y}}: Vector of predicted outputs RM\in \mathbb{R}^M.

Cost Function

Express the OLS loss in matrix form:

J(w~)=12MXw~y2J(\tilde{\mathbf{w}}) = \frac{1}{2M} \| \mathbf{X} \tilde{\mathbf{w}} - \mathbf{y} \|^2
  • y\mathbf{y}: Vector of true targets RM\in \mathbb{R}^M.
  • \| \cdot \|: Euclidean norm (L2 norm).

Finding the Closed-Form Solution

To find the closed-form solution of the minimization problem, we will explicitly take the derivative of JJ with respect to w~\mathbf{\tilde{w}} and set it to zero:

w~J(w~)=w~12MXw~y2=12Mw~(Xw~y)(Xw~y)=1MX(Xw~y)\begin{aligned} \nabla_{\tilde{\mathbf{w}}} J(\tilde{\mathbf{w}}) &= \nabla_{\tilde{\mathbf{w}}} \frac{1}{2M} \| \mathbf{X} \tilde{\mathbf{w}} - \mathbf{y} \|^2\\ &= \frac{1}{2M} \nabla_{\tilde{\mathbf{w}}} (\mathbf{X} \tilde{\mathbf{w}} - \mathbf{y})^\top (\mathbf{X} \tilde{\mathbf{w}} - \mathbf{y})\\ &\cdots\\ &= \frac{1}{M} \mathbf{X}^\top (\mathbf{X} \tilde{\mathbf{w}} - \mathbf{y})\\ \end{aligned}

Set the gradient equal to zero:

J(w~)=1MX(Xw~y)=0\nabla J(\tilde{\mathbf{w}}) = \frac{1}{M} \mathbf{X}^\top (\mathbf{X} \tilde{\mathbf{w}} - \mathbf{y}) = \mathbf{0}

Multiply both sides by MM:

X(Xw~y)=0\mathbf{X}^\top (\mathbf{X} \tilde{\mathbf{w}} - \mathbf{y}) = \mathbf{0}

Rewriting it, we obtain the normal equations:

XXw~=Xy\mathbf{X}^\top \mathbf{X} \tilde{\mathbf{w}} = \mathbf{X}^\top \mathbf{y}

With solution:

w~=(XX)1Xy\tilde{\mathbf{w}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
  • (XX)1(\mathbf{X}^\top \mathbf{X})^{-1}: Inverse of the matrix XX\mathbf{X}^\top \mathbf{X}.

This formula provides the closed-form solution for the optimal weights in linear regression.

Conditions for Invertibility

For the inverse (XX)1(\mathbf{X}^\top \mathbf{X})^{-1} to exist, the matrix XX\mathbf{X}^\top \mathbf{X} must be invertible.

When Is XX\mathbf{X}^\top \mathbf{X} Invertible?

  • Full Rank: If XX\mathbf{X}^\top \mathbf{X} is of full rank (i.e., rank equal to d+1d+1), then it is invertible.
  • Linearly Independent Columns: This requires that the columns of X\mathbf{X} (features) are linearly independent.

Issues with Invertibility

  • Duplicate Features: If two or more features are linearly dependent (e.g., one feature is a multiple of another), XX\mathbf{X}^\top \mathbf{X} is not invertible.
  • More Features Than Samples: If the number of features exceeds the number of samples (d+1>Md+1 > M), the matrix cannot be full rank and is thus not invertible.

Using the Moore-Penrose Pseudo-Inverse

When XX\mathbf{X}^\top \mathbf{X} is not invertible, we can use the Moore-Penrose pseudo-inverse to find a solution.

The Moore-Penrose pseudo-inverse of a matrix A\mathbf{A}, denoted as A+\mathbf{A}^+, is a generalization of the inverse matrix.

The solution becomes:

w~=(XX)+Xy\tilde{\mathbf{w}} = (\mathbf{X}^\top \mathbf{X})^+ \mathbf{X}^\top \mathbf{y}
  • (XX)+(\mathbf{X}^\top \mathbf{X})^+: Pseudo-inverse of XX\mathbf{X}^\top \mathbf{X}.

This method provides the least-norm solution: Among all possible solutions, it finds the one with the smallest Euclidean norm.

Limitations of Closed-Form Solutions

While the closed-form solution is elegant and provides an exact answer under certain conditions, it has limitations:

Computational Complexity

  • Computing the inverse of a matrix has a computational complexity of approximately O(d3)\mathcal{O}(d^3).
  • For high-dimensional data (large dd), matrix inversion becomes computationally expensive.

Numerical Stability

  • If XX\mathbf{X}^\top \mathbf{X} is close to singular or ill-conditioned, numerical errors can significantly affect the solution.
  • In finite-precision arithmetic, small errors can be magnified during inversion.

Overfitting with High-Dimensional Data

  • When d+1>Md+1 > M, the model can perfectly fit the training data but generalize poorly to new data.
  • Regularization techniques (e.g., Ridge Regression) add a penalty term to prevent overfitting but modify the closed-form solution.

Non-Invertibility Due to Collinearity

  • Presence of highly correlated features leads to non-invertibility; feature selection or dimensionality reduction techniques may be necessary.
Advantages of Closed-Form Solutions

Despite the limitations, closed-form solutions have significant advantages:

Simplicity and Efficiency (for Small dd)

  • Provides an explicit formula for the optimal weights.
  • Eliminates the need for gradient descent or other optimization methods.

Exact Solution

  • If computed accurately, yields the precise minimum of the loss function.
  • Useful as a baseline to compare with other algorithms.

Recap

1. What is the primary objective of linear regression?

2. What does the hypothesis class for linear regression consist of?

3. How do we find the best parameters to model our data in linear regression?

4. Under which condition is XX\mathbf{X}^\top \mathbf{X} guaranteed to be invertible?

5. What is a limitation of using the closed-form solution for linear regression?

What's Next