Skip to main content

Linear Regression

info

In this chapter you'll be introduced to:

  • Linear Regression: Understanding the fundamentals of linear regression in supervised learning.
  • Hypothesis Class for Linear Regression: Defining the set of functions used in linear regression models.
  • Linear Regression Model: Formulating the linear model with weights and bias.
  • Affine Transformations and Extended Features: Exploring the relationship between linear and affine transformations.
  • Importance of Linear Models: Discussing why linear models are foundational and their relevance to more complex models.
  • Training Criteria: Introducing the mean squared error loss function used for training linear regression models.
  • Matrix-Vector Representation: Representing linear regression models and loss functions using matrices and vectors for computational efficiency.
  • Future Directions: Preparing for discussions on convexity, optimization, and theoretical properties of linear regression.

Linear Regression

Linear Regression is one of the most basic algorithm in supervised learning, particularly in regression tasks where the target variable is continuous. It involves learning a linear relationship between input features and the target output.

In a supervised learning setting, we have:

  • Training Data: A set of MM examples {(x(m),t(m))}m=1M\{ (\mathbf{x}^{(m)}, t^{(m)}) \}_{m=1}^M, where:

    • x(m)Rd\mathbf{x}^{(m)} \in \mathbb{R}^d is the input vector for the mm-th example.
    • t(m)Rt^{(m)} \in \mathbb{R} is the target output for the mm-th example.
  • Goal: Learn a function (hypothesis) hh that maps inputs to outputs:

    h:RdRh: \mathbb{R}^d \rightarrow \mathbb{R}
  • Prediction: For a new input x\mathbf{x}^*, predict the output t=h(x)t^* = h(\mathbf{x}^*).

Hypothesis Class for Linear Regression

The hypothesis class in linear regression consists of all functions that can be represented as a linear combination of the input features plus a bias term. Formally, the hypothesis class H\mathcal{H} is defined as:

H={h(x)h(x)=wx+b,  wRd,  bR}\mathcal{H} = \{ h(\mathbf{x}) \mid h(\mathbf{x}) = \mathbf{w}^\top \mathbf{x} + b, \; \mathbf{w} \in \mathbb{R}^d, \; b \in \mathbb{R} \}
  • w\mathbf{w}: Weight vector (parameters) of the model.
  • bb: Bias term (intercept).

By specifying this hypothesis class, we're considering all possible linear functions parameterized by w\mathbf{w} and bb.

The Linear Regression Model

The linear regression model predicts the output yy (also denoted as t^\hat{t}) as:

y=t^=wx+by = \hat{t} = \mathbf{w}^\top \mathbf{x} + b
  • x\mathbf{x}: Input feature vector.
  • w\mathbf{w}: Weight vector.
  • bb: Bias term.
Example: One-Dimensional Input (d=1d=1)
  • Input: xRx \in \mathbb{R}
  • Model: y=wx+by = wx + b
  • Hypothesis Class: All lines (except vertical lines) in 2D space.

Parameters

  • Weights (w\mathbf{w}): Coefficients for each input feature.
  • Bias (bb): Allows the regression line (or hyperplane) to shift upward or downward.
Weight (w)1.00
-44
Bias (b)5.00
010

Note: The weight and bias are constrained in the example for simplicity of the plot.

Linear vs. Affine Transformations

Linear Transformation

A linear transformation has no bias term:

y=wxy = \mathbf{w}^\top \mathbf{x}
  • Properties:
    • Passes through the origin.
    • Does not allow shifting.

Affine Transformation

An affine transformation includes a bias term:

y=wx+by = \mathbf{w}^\top \mathbf{x} + b
  • Properties:
    • Allows shifting of the function.
    • More flexible in fitting data.

We can represent affine transformations as linear transformations in a higher-dimensional space by introducing an augmented feature vector.

Augmented Feature Vector

Define:

  • Extended Input Vector: x~=[1,x1,x2,,xd]Rd+1\tilde{\mathbf{x}} = [1, x_1, x_2, \dots, x_d]^\top \in \mathbb{R}^{d+1}
  • Extended Weight Vector: w~=[b,w1,w2,,wd]Rd+1\tilde{\mathbf{w}} = [b, w_1, w_2, \dots, w_d]^\top \in \mathbb{R}^{d+1}

Then, the model becomes:

y=w~x~y = \tilde{\mathbf{w}}^\top \tilde{\mathbf{x}}

This representation simplifies notation and allows us to treat the bias term as part of the weight vector.

Why Linear Models?

Linear models are straightforward and easy to interpret. Additionally, they provide a foundation for understanding more complex models.

Nonlinear Relationships

While many real-world relationships are nonlinear, linear models can approximate nonlinear functions through:

Nonlinear Feature Transformations

  • Polynomial Features: Include x2x^2, x3x^3, etc., as additional features.
  • Feature Engineering: Manually create features that capture nonlinearities.

Kernel Methods

  • Nonlinear Kernels: Use kernel functions to map inputs into a higher-dimensional space where linear regression can be applied.

Neural Networks

  • Nonlinear Activation Functions: Introduce nonlinearity through activation functions in neural networks.
  • Deep Learning: Stack multiple layers to capture complex patterns.

Understanding linear regression is essential before moving on to more advanced models like neural networks and support vector machines.

Training Criteria for Linear Regression

The goal is to find the parameters w\mathbf{w} and bb that minimize the difference between the predicted outputs and the true targets.

Mean Squared Error (MSE) Loss

The MSE loss function is commonly used:

J(w,b)=12Mm=1M(y(m)t(m))2J(\mathbf{w}, b) = \frac{1}{2M} \sum_{m=1}^M (y^{(m)} - t^{(m)})^2
  • y(m)y^{(m)}: Predicted output for the mm-th example.
  • t(m)t^{(m)}: True target for the mm-th example.
  • MM: Number of training examples.

Loss for a Single Example

For a single training example:

J(m)(w,b)=12(y(m)t(m))2J^{(m)}(\mathbf{w}, b) = \frac{1}{2} (y^{(m)} - t^{(m)})^2

Why Use the Mean Squared Error?

  • Penalizes Larger Errors More Severely: Squaring the errors emphasizes larger discrepancies.
  • Mathematical Convenience: Differentiable and leads to closed-form solutions.
  • Theoretical Justification: Relates to the assumption of normally distributed errors.

Visualizing the Loss Function

Consider a one-dimensional input:

  • Data Points: Plotted on a scatter plot with input xx on the horizontal axis and target tt on the vertical axis.
  • Regression Line: Represents the model's predictions.
  • Residuals: Vertical distances between data points and the regression line.

The MSE loss calculates the average of the squared residuals.

Matrix-Vector Representation

Representing the linear regression model and loss function using matrices and vectors simplifies computations, especially for large datasets.

Design Matrix

  • Definition: The design matrix X\mathbf{X} contains all input vectors stacked as rows.
  • Augmented with Bias Term: If including the bias term, prepend a column of ones to X\mathbf{X}.

Formally

X=[1x1(1)x2(1)xd(1)1x1(2)x2(2)xd(2)1x1(M)x2(M)xd(M)]RM×(d+1)\mathbf{X} = \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} & \dots & x_d^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} & \dots & x_d^{(2)} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_1^{(M)} & x_2^{(M)} & \dots & x_d^{(M)} \end{bmatrix} \quad \in \mathbb{R}^{M \times (d+1)}

Weight Vector

w~=[bw1w2wd]Rd+1\tilde{\mathbf{w}} = \begin{bmatrix} b \\ w_1 \\ w_2 \\ \vdots \\ w_d \end{bmatrix} \quad \in \mathbb{R}^{d+1}

Predictions

Compute predictions for all training examples simultaneously:

y=Xw~\mathbf{y} = \mathbf{X} \tilde{\mathbf{w}}
  • y\mathbf{y}: Vector of predicted outputs RM\in \mathbb{R}^M.

Loss Function

Express the MSE loss in matrix form:

J(w~)=12MXw~t2J(\tilde{\mathbf{w}}) = \frac{1}{2M} \| \mathbf{X} \tilde{\mathbf{w}} - \mathbf{t} \|^2
  • t\mathbf{t}: Vector of true targets RM\in \mathbb{R}^M.
  • \| \cdot \|: Euclidean norm (L2 norm).

Advantages

  • Computational Efficiency: Enables the use of vectorized operations.
  • Simplifies Derivatives: Easier to compute gradients for optimization.

Recap

In this chapter, we've covered:

  • Linear Regression Fundamentals: Understanding the linear regression model and its components.
  • Hypothesis Class Definition: Specifying the set of linear functions used in linear regression.
  • Affine Transformations: Incorporating the bias term and representing it using extended feature vectors.
  • Importance of Linear Models: Recognizing the foundational role of linear models in machine learning.
  • Training Criteria: Introducing the mean squared error loss function and its significance.
  • Matrix-Vector Representation: Leveraging matrices and vectors for efficient computation and notation.