Probabilistic Interpretation
In this chapter, you'll be introduced to:
- Probability in Machine Learning: Understanding the role of probability in modeling and interpreting machine learning algorithms.
- Recap of Linear Regression: Revisiting the linear regression model and the mean squared error (MSE) loss function.
- Basics of Probability Theory: Reviewing fundamental concepts, including probability axioms, joint, marginal, and conditional probabilities, and Bayes' theorem.
- Frequentist vs. Bayesian Interpretations: Exploring different philosophical approaches to probability.
- Gaussian Distribution: Understanding the properties of the normal distribution in both univariate and multivariate cases.
- Maximum Likelihood Estimation (MLE): Learning how to estimate model parameters by maximizing the likelihood of observed data.
- Probabilistic Derivation of MSE Loss: Showing how the MSE loss function arises from the MLE under Gaussian noise assumptions.
In previous chapters, we explored the linear regression model and methods for minimizing the mean squared error (MSE) loss function, including closed-form solutions and gradient-based optimization techniques. While we have used the MSE loss due to its intuitive appeal and mathematical convenience, we have yet to delve into the probabilistic foundations that justify its use.
In this chapter, we will provide a probabilistic interpretation of the linear regression model and the MSE loss function. By framing linear regression within a probabilistic context, we gain deeper insights into the assumptions underlying the model and establish a theoretical foundation for the MSE loss. This approach also introduces essential concepts in probability theory that are widely used in machine learning.
The Role of Probability in Machine Learning
Probability theory provides a formal framework for modeling uncertainty and making inferences based on data. In machine learning, probabilistic models allow us to:
- Quantify Uncertainty: Assess the confidence of predictions and model parameters.
- Incorporate Prior Knowledge: Use prior distributions to incorporate existing beliefs or information.
- Make Probabilistic Predictions: Output probabilities rather than deterministic predictions.
- Facilitate Learning: Derive training objectives based on probabilistic principles, such as maximizing likelihood.
While probability is a powerful tool in machine learning, it's important to note that not all machine learning methods require probabilistic interpretations. However, understanding probability enhances our ability to design, analyze, and justify models and algorithms.
Recap of Linear Regression and MSE Loss
Linear Regression Model
In linear regression, we model the relationship between input features and a continuous target variable using a linear function:
- : Weight vector (parameters of the model).
- : Bias term (intercept).
- : Predicted output.
For convenience, we often include the bias term within the weight vector by augmenting the input vector with a 1:
Mean Squared Error Loss Function
The MSE loss measures the average squared difference between the predicted outputs and the true target values:
- : Number of training samples.
- : Model prediction for the -th sample.
- : True target value for the -th sample.
We aim to find the weight vector that minimizes .
Basics of Probability Theory
To build a probabilistic interpretation of linear regression, we first revisit some fundamental concepts in probability theory.
Probability Axioms
Probability is a function that assigns a real number between 0 and 1 to events in a sample space :
-
Non-negativity:
-
Normalization:
-
Additivity (for mutually exclusive events):
Joint, Marginal, and Conditional Probabilities
-
Joint Probability:
Represents the probability of both events and occurring.
-
Marginal Probability:
The probability of event occurring, irrespective of .
-
Conditional Probability:
The probability of occurring given that has occurred.
Bayes' Theorem
Bayes' theorem relates conditional probabilities in reverse order:
This theorem allows us to update our beliefs about after observing .
Independence
Two events and are independent if:
This implies that the occurrence of provides no information about , and vice versa.
Expectation
The expected value (or mean) of a random variable with probability distribution is:
The expectation operator computes the average value of weighted by its probability distribution.
Frequentist vs. Bayesian Interpretations
Probability can be interpreted in different ways, leading to two primary schools of thought: frequentist and Bayesian.
Frequentist Interpretation
- Definition: Probability represents the long-run frequency of an event occurring in repeated, identical trials.
- Key Concept: Probabilities are objective properties of the physical world.
- Limitations:
- Not applicable to non-repeatable events (e.g., probability of rain tomorrow).
- Cannot assign probabilities to hypotheses or parameters.
Bayesian Interpretation
- Definition: Probability quantifies a degree of belief or uncertainty about an event or proposition.
- Key Concept: Probabilities are subjective and can be updated with new evidence using Bayes' theorem.
- Advantages:
- Applicable to unique events and hypotheses.
- Allows incorporating prior knowledge and updating beliefs.
Both interpretations follow the same mathematical rules but differ philosophically in the meaning assigned to probability.
Gaussian Distribution
The Gaussian (normal) distribution is a continuous probability distribution characterized by its mean and variance.
Univariate Gaussian Distribution
A random variable follows a normal distribution with mean and variance :
The probability density function (PDF) is:
- Properties:
- Symmetric about the mean .
- The spread is determined by the variance .
- The total area under the curve is 1.
Multivariate Gaussian Distribution
A random vector follows a multivariate normal distribution with mean vector and covariance matrix :
The PDF is:
- : Mean vector ().
- : Covariance matrix (), must be positive definite.
- Properties:
- Generalizes the univariate normal distribution to higher dimensions.
- The shape is determined by .
Probabilistic Interpretation of Linear Regression
We now incorporate probability into the linear regression model by treating the target variable as a random variable influenced by Gaussian noise.
Assumptions
-
Data Generation Process:
- : Target variable for the -th sample.
- : Weight vector (including bias).
- : Augmented input vector.
- : Noise term.
-
Noise Term:
- Zero-mean Gaussian noise with variance .
- Assumed to be independent and identically distributed (i.i.d.) across samples.
Likelihood Function
The likelihood function quantifies the probability of observing the data given the model parameters:
Since each is conditionally independent given and .
Using the Gaussian assumption:
Log-Likelihood Function
To simplify computations, we take the logarithm of the likelihood function:
Maximum Likelihood Estimation (MLE)
We aim to find the parameters and that maximize the log-likelihood function:
Estimating
Since appears in both terms, we can simplify the problem by focusing on first.
-
Objective Function (Negative Log-Likelihood):
-
Minimizing is equivalent to maximizing the log-likelihood (ignoring constants and ).
Connection to MSE Loss
The objective function derived from the negative log-likelihood is proportional to the MSE loss function used in linear regression.
- Conclusion: Maximizing the likelihood under the Gaussian noise assumption leads to minimizing the MSE loss.
Estimating
After finding , we can estimate :
This is the sample variance of the residuals.
Implications and Insights
- Justification of MSE Loss: The MSE loss function is not just a convenient choice; it arises naturally from probabilistic assumptions about the data generation process.
- Assumption of Gaussian Noise: The derivation relies on the assumption that the noise in the target variable is Gaussian and homoscedastic (constant variance).
- Interpretation of Linear Regression: Linear regression can be viewed as finding the maximum likelihood estimate of the parameters under the Gaussian noise model.
- Extension to Other Distributions: If the noise follows a different distribution, the corresponding loss function would change (e.g., Laplace noise leads to the mean absolute error loss).