Linear Regression
In this chapter you'll be introduced to:
- Linear Regression: Understanding the fundamentals of linear regression in supervised learning.
- Hypothesis Class for Linear Regression: Defining the set of functions used in linear regression models.
- Linear Regression Model: Formulating the linear model with weights and bias.
- Training Criteria: Introducing the mean squared error loss function used for training linear regression models.
- Matrix-Vector Representation: Representing linear regression models and loss functions using matrices and vectors for computational efficiency.
Linear Regression
Linear Regression is one of the most basic algorithm in supervised learning, particularly in regression tasks where the target variable is continuous. It involves learning a linear relationship between input features and the target output.
In a supervised learning setting, we have:
-
Training Data: A set of examples , where:
- is the input vector for the -th example.
- is the target output for the -th example.
-
Goal: Learn a function (hypothesis) that maps inputs to outputs:
-
Prediction: For a new input , predict the output .
Hypothesis Class for Linear Regression
The hypothesis class in linear regression consists of all functions that can be represented as a linear combination of the input features plus a bias term. Formally, the hypothesis class is defined as:
- : Weight vector (parameters) of the model.
- : Bias term (intercept).
By specifying this hypothesis class, we consider all possible affine functions parameterized by and .
The Linear Regression Model
The linear regression model predicts the output as:
- : Input feature vector.
- : Weight vector.
- : Bias term.
- Input:
- Model:
- Hypothesis Class: All lines (except vertical lines) in 2D space.
Parameters
- Weights (): Coefficients for each input feature.
- Bias (): Allows the regression line (or hyperplane) to shift upward or downward.
Note: The weight and bias are constrained in the example for simplicity of the plot.
Linear Transformation
A linear transformation has no bias term:
- Properties:
- Passes through the origin.
- Does not allow shifting.
Affine Transformation
An affine transformation includes a bias term:
- Properties:
- Allows shifting of the function.
- More flexible in fitting data.
We can represent affine transformations as linear transformations in a higher-dimensional space by introducing an augmented feature vector.
Augmented Feature Vector
Define:
- Extended Input Vector:
- Extended Weight Vector:
Then, the model becomes:
This representation simplifies notation and allows us to treat the bias term as part of the weight vector.
Training Criteria for Linear Regression
The goal is to find the parameters and that minimize some function that measures the difference between the predicted outputs and the true targets.
The least-squares cost function is commonly used:
- : True outcome for the -th example.
- : Predicted output for the -th example, with parameters .
- : Number of training examples.
This loss function defines what is know as the Ordinary Least Squares (OLS) regression model. We're going to see in the next few chapters that this algorithm is a special case of a broader family of algorithms.
Matrix Representation
Representing the linear regression model and loss function using matrices and vectors simplifies computations, especially for large datasets.
Design Matrix
The design matrix contains all input vectors stacked as rows. If including the bias term, we prepend a column of ones to :
Weight Vector
Predictions
Compute predictions for all training examples simultaneously:
- : Vector of predicted outputs .
Cost Function
Express the OLS loss in matrix form:
- : Vector of true targets .
- : Euclidean norm (L2 norm).
Finding the Closed-Form Solution
To find the closed-form solution of the minimization problem, we will explicitly take the derivative of with respect to and set it to zero:
Set the gradient equal to zero:
Multiply both sides by :
Rewriting it, we obtain the normal equations:
With solution:
- : Inverse of the matrix .
This formula provides the closed-form solution for the optimal weights in linear regression.
Conditions for Invertibility
For the inverse to exist, the matrix must be invertible.
When Is Invertible?
- Full Rank: If is of full rank (i.e., rank equal to ), then it is invertible.
- Linearly Independent Columns: This requires that the columns of (features) are linearly independent.
Issues with Invertibility
- Duplicate Features: If two or more features are linearly dependent (e.g., one feature is a multiple of another), is not invertible.
- More Features Than Samples: If the number of features exceeds the number of samples (), the matrix cannot be full rank and is thus not invertible.
Using the Moore-Penrose Pseudo-Inverse
When is not invertible, we can use the Moore-Penrose pseudo-inverse to find a solution.
The Moore-Penrose pseudo-inverse of a matrix , denoted as , is a generalization of the inverse matrix.
The solution becomes:
- : Pseudo-inverse of .
This method provides the least-norm solution: Among all possible solutions, it finds the one with the smallest Euclidean norm.
While the closed-form solution is elegant and provides an exact answer under certain conditions, it has limitations:
Computational Complexity
- Computing the inverse of a matrix has a computational complexity of approximately .
- For high-dimensional data (large ), matrix inversion becomes computationally expensive.
Numerical Stability
- If is close to singular or ill-conditioned, numerical errors can significantly affect the solution.
- In finite-precision arithmetic, small errors can be magnified during inversion.
Overfitting with High-Dimensional Data
- When , the model can perfectly fit the training data but generalize poorly to new data.
- Regularization techniques (e.g., Ridge Regression) add a penalty term to prevent overfitting but modify the closed-form solution.
Non-Invertibility Due to Collinearity
- Presence of highly correlated features leads to non-invertibility; feature selection or dimensionality reduction techniques may be necessary.
Despite the limitations, closed-form solutions have significant advantages:
Simplicity and Efficiency (for Small )
- Provides an explicit formula for the optimal weights.
- Eliminates the need for gradient descent or other optimization methods.
Exact Solution
- If computed accurately, yields the precise minimum of the loss function.
- Useful as a baseline to compare with other algorithms.