Probabilistic Interpretation
In this chapter, you'll be introduced to:
- Probability in Machine Learning: Understanding the role of probability in modeling and interpreting machine learning algorithms.
- Recap of Linear Regression: Revisiting the linear regression model and the mean squared error (MSE) loss function.
- Basics of Probability Theory: Reviewing fundamental concepts, including probability axioms, joint, marginal, and conditional probabilities, and Bayes' theorem.
- Frequentist vs. Bayesian Interpretations: Exploring different philosophical approaches to probability.
- Gaussian Distribution: Understanding the properties of the normal distribution in both univariate and multivariate cases.
- Maximum Likelihood Estimation (MLE): Learning how to estimate model parameters by maximizing the likelihood of observed data.
- Probabilistic Derivation of MSE Loss: Showing how the MSE loss function arises from the MLE under Gaussian noise assumptions.
In previous chapters, we explored the linear regression model and methods for minimizing the mean squared error (MSE) loss function, including closed-form solutions and gradient-based optimization techniques. While we have used the MSE loss due to its intuitive appeal and mathematical convenience, we have yet to delve into the probabilistic foundations that justify its use.
In this chapter, we will provide a probabilistic interpretation of the linear regression model and the MSE loss function. By framing linear regression within a probabilistic context, we gain deeper insights into the assumptions underlying the model and establish a theoretical foundation for the MSE loss. This approach also introduces essential concepts in probability theory that are widely used in machine learning.
The Role of Probability in Machine Learning
Probability theory provides a formal framework for modeling uncertainty and making inferences based on data. In machine learning, probabilistic models allow us to:
- Quantify Uncertainty: Assess the confidence of predictions and model parameters.
- Incorporate Prior Knowledge: Use prior distributions to incorporate existing beliefs or information.
- Make Probabilistic Predictions: Output probabilities rather than deterministic predictions.
- Facilitate Learning: Derive training objectives based on probabilistic principles, such as maximizing likelihood.
While probability is a powerful tool in machine learning, it's important to note that not all machine learning methods require probabilistic interpretations. However, understanding probability enhances our ability to design, analyze, and justify models and algorithms.
Recap of Linear Regression and MSE Loss
Linear Regression Model
In linear regression, we model the relationship between input features and a continuous target variable using a linear function:
- : Weight vector (parameters of the model).
- : Bias term (intercept).
- : Predicted output.
For convenience, we often include the bias term within the weight vector by augmenting the input vector with a 1:
Mean Squared Error Loss Function
The MSE loss measures the average squared difference between the predicted outputs and the true target values:
- : Number of training samples.
- : Model prediction for the -th sample.
- : True target value for the -th sample.
We aim to find the weight vector that minimizes .
Basics of Probability Theory
To build a probabilistic interpretation of linear regression, we first revisit some fundamental concepts in probability theory.
Probability Axioms
Probability is a function that assigns a real number between 0 and 1 to events in a sample space :
-
Non-negativity:
-
Normalization:
-
Additivity (for mutually exclusive events):
Joint, Marginal, and Conditional Probabilities
-
Joint Probability:
Represents the probability of both events and occurring.
-
Marginal Probability:
The probability of event occurring, irrespective of .
-
Conditional Probability:
The probability of occurring given that has occurred.
Bayes' Theorem
Bayes' theorem relates conditional probabilities in reverse order:
This theorem allows us to update our beliefs about after observing .
Independence
Two events and are independent if:
This implies that the occurrence of provides no information about , and vice versa.
Expectation
The expected value (or mean) of a random variable with probability distribution is:
The expectation operator computes the average value of weighted by its probability distribution.
Frequentist vs. Bayesian Interpretations
Probability can be interpreted in different ways, leading to two primary schools of thought: frequentist and Bayesian.
Frequentist Interpretation
- Definition: Probability represents the long-run frequency of an event occurring in repeated, identical trials.
- Key Concept: Probabilities are objective properties of the physical world.
- Limitations:
- Not applicable to non-repeatable events (e.g., probability of rain tomorrow).
- Cannot assign probabilities to hypotheses or parameters.
Bayesian Interpretation
- Definition: Probability quantifies a degree of belief or uncertainty about an event or proposition.
- Key Concept: Probabilities are subjective and can be updated with new evidence using Bayes' theorem.
- Advantages:
- Applicable to unique events and hypotheses.
- Allows incorporating prior knowledge and updating beliefs.
Both interpretations follow the same mathematical rules but differ philosophically in the meaning assigned to probability.
Gaussian Distribution
The Gaussian (normal) distribution is a continuous probability distribution characterized by its mean and variance.
Univariate Gaussian Distribution
A random variable follows a normal distribution with mean and variance :
The probability density function (PDF) is:
- Properties:
- Symmetric about the mean .
- The spread is determined by the variance .
- The total area under the curve is 1.
Multivariate Gaussian Distribution
A random vector follows a multivariate normal distribution with mean vector and covariance matrix :
The PDF is: