Probabilistic Interpretation

info

In this chapter, you'll be introduced to:

Probability in Machine Learning: Understanding the role of probability in modeling and interpreting machine learning algorithms.
Recap of Linear Regression: Revisiting the linear regression model and the mean squared error (MSE) loss function.
Basics of Probability Theory: Reviewing fundamental concepts, including probability axioms, joint, marginal, and conditional probabilities, and Bayes' theorem.
Frequentist vs. Bayesian Interpretations: Exploring different philosophical approaches to probability.
Gaussian Distribution: Understanding the properties of the normal distribution in both univariate and multivariate cases.
Maximum Likelihood Estimation (MLE): Learning how to estimate model parameters by maximizing the likelihood of observed data.
Probabilistic Derivation of MSE Loss: Showing how the MSE loss function arises from the MLE under Gaussian noise assumptions.

In previous chapters, we explored the linear regression model and methods for minimizing the mean squared error (MSE) loss function, including closed-form solutions and gradient-based optimization techniques. While we have used the MSE loss due to its intuitive appeal and mathematical convenience, we have yet to delve into the probabilistic foundations that justify its use.

In this chapter, we will provide a probabilistic interpretation of the linear regression model and the MSE loss function. By framing linear regression within a probabilistic context, we gain deeper insights into the assumptions underlying the model and establish a theoretical foundation for the MSE loss. This approach also introduces essential concepts in probability theory that are widely used in machine learning.

The Role of Probability in Machine Learning

Probability theory provides a formal framework for modeling uncertainty and making inferences based on data. In machine learning, probabilistic models allow us to:

Quantify Uncertainty: Assess the confidence of predictions and model parameters.
Incorporate Prior Knowledge: Use prior distributions to incorporate existing beliefs or information.
Make Probabilistic Predictions: Output probabilities rather than deterministic predictions.
Facilitate Learning: Derive training objectives based on probabilistic principles, such as maximizing likelihood.

While probability is a powerful tool in machine learning, it's important to note that not all machine learning methods require probabilistic interpretations. However, understanding probability enhances our ability to design, analyze, and justify models and algorithms.

Recap of Linear Regression and MSE Loss

Linear Regression Model

In linear regression, we model the relationship between input features $\mathbf{x} \in \mathbb{R}^d$ and a continuous target variable $t \in \mathbb{R}$ using a linear function:

y(\mathbf{x}) = \mathbf{w}^\top \mathbf{x} + b

$\mathbf{w}$ : Weight vector (parameters of the model).
$b$ : Bias term (intercept).
$y(\mathbf{x})$ : Predicted output.

For convenience, we often include the bias term within the weight vector by augmenting the input vector with a 1:

\tilde{\mathbf{x}} = [1, x_1, x_2, \dots, x_d]^\top \\ \tilde{\mathbf{w}} = [b, w_1, w_2, \dots, w_d]^\top

Mean Squared Error Loss Function

The MSE loss measures the average squared difference between the predicted outputs and the true target values:

J(\tilde{\mathbf{w}}) = \frac{1}{2M} \sum_{m=1}^{M} \left( y^{(m)} - t^{(m)} \right)^2

$M$ : Number of training samples.
$y^{(m)}$ : Model prediction for the $m$ -th sample.
$t^{(m)}$ : True target value for the $m$ -th sample.

We aim to find the weight vector $\tilde{\mathbf{w}}$ that minimizes $J(\tilde{\mathbf{w}})$ .

Basics of Probability Theory

To build a probabilistic interpretation of linear regression, we first revisit some fundamental concepts in probability theory.

Probability Axioms

Probability is a function that assigns a real number between 0 and 1 to events in a sample space $\Omega$ :

Non-negativity:
$P(A) \geq 0 \quad \text{for any event } A \subseteq \Omega$
Normalization:
$P(\Omega) = 1$
Additivity (for mutually exclusive events):
$P\left( \bigcup_{i} A_i \right) = \sum_{i} P(A_i) \quad \text{if } A_i \cap A_j = \emptyset \text{ for } i \neq j$

Joint, Marginal, and Conditional Probabilities

Joint Probability:
$P(A, B) = P(A \cap B)$
Represents the probability of both events $A$ and $B$ occurring.
Marginal Probability:
$P(A) = \sum_{B} P(A, B)$
The probability of event $A$ occurring, irrespective of $B$ .
Conditional Probability:
$P(A \mid B) = \frac{P(A, B)}{P(B)} \quad \text{if } P(B) > 0$
The probability of $A$ occurring given that $B$ has occurred.

Bayes' Theorem

Bayes' theorem relates conditional probabilities in reverse order:

P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B)}

This theorem allows us to update our beliefs about $A$ after observing $B$ .

Independence

Two events $A$ and $B$ are independent if:

P(A, B) = P(A) P(B)

This implies that the occurrence of $A$ provides no information about $B$ , and vice versa.

Expectation

The expected value (or mean) of a random variable $X$ with probability distribution $P(X)$ is:

\mathbb{E}[X] = \sum_{x} x P(x) \quad \text{(discrete case)}

\mathbb{E}[X] = \int_{-\infty}^{\infty} x P(x) \, dx \quad \text{(continuous case)}

The expectation operator $\mathbb{E}$ computes the average value of $X$ weighted by its probability distribution.

Frequentist vs. Bayesian Interpretations

Probability can be interpreted in different ways, leading to two primary schools of thought: frequentist and Bayesian.

Frequentist Interpretation

Definition: Probability represents the long-run frequency of an event occurring in repeated, identical trials.
Key Concept: Probabilities are objective properties of the physical world.
Limitations:
- Not applicable to non-repeatable events (e.g., probability of rain tomorrow).
- Cannot assign probabilities to hypotheses or parameters.

Bayesian Interpretation

Definition: Probability quantifies a degree of belief or uncertainty about an event or proposition.
Key Concept: Probabilities are subjective and can be updated with new evidence using Bayes' theorem.
Advantages:
- Applicable to unique events and hypotheses.
- Allows incorporating prior knowledge and updating beliefs.

Both interpretations follow the same mathematical rules but differ philosophically in the meaning assigned to probability.

Gaussian Distribution

The Gaussian (normal) distribution is a continuous probability distribution characterized by its mean and variance.

Univariate Gaussian Distribution

A random variable $X$ follows a normal distribution with mean $\mu$ and variance $\sigma^2$ :

X \sim \mathcal{N}(\mu, \sigma^2)

The probability density function (PDF) is:

P(X = x) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(x - \mu)^2}{2\sigma^2} \right)

Properties:
- Symmetric about the mean $\mu$ .
- The spread is determined by the variance $\sigma^2$ .
- The total area under the curve is 1.

Multivariate Gaussian Distribution

A random vector $\mathbf{X} \in \mathbb{R}^d$ follows a multivariate normal distribution with mean vector $\boldsymbol{\mu}$ and covariance matrix $\boldsymbol{\Sigma}$ :

\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})

The PDF is:

P(\mathbf{X} = \mathbf{x}) = \frac{1}{(2\pi)^{d/2} |\boldsymbol{\Sigma}|^{1/2}} \exp\left( -\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) \right)

$\boldsymbol{\mu}$ : Mean vector ( $d \times 1$ ).
$\boldsymbol{\Sigma}$ : Covariance matrix ( $d \times d$ ), must be positive definite.
Properties:
- Generalizes the univariate normal distribution to higher dimensions.
- The shape is determined by $\boldsymbol{\Sigma}$ .

Probabilistic Interpretation of Linear Regression

We now incorporate probability into the linear regression model by treating the target variable as a random variable influenced by Gaussian noise.

Assumptions

Data Generation Process:
$t^{(m)} = \tilde{\mathbf{w}}^\top \tilde{\mathbf{x}}^{(m)} + \varepsilon^{(m)}$
- $t^{(m)}$ : Target variable for the $m$ -th sample.
- $\tilde{\mathbf{w}}$ : Weight vector (including bias).
- $\tilde{\mathbf{x}}^{(m)}$ : Augmented input vector.
- $\varepsilon^{(m)}$ : Noise term.
Noise Term:
$\varepsilon^{(m)} \sim \mathcal{N}(0, \sigma^2)$
- Zero-mean Gaussian noise with variance $\sigma^2$ .
- Assumed to be independent and identically distributed (i.i.d.) across samples.

Likelihood Function

The likelihood function quantifies the probability of observing the data given the model parameters:

P(\mathbf{t} \mid \mathbf{X}, \tilde{\mathbf{w}}, \sigma^2) = \prod_{m=1}^{M} P(t^{(m)} \mid \tilde{\mathbf{x}}^{(m)}, \tilde{\mathbf{w}}, \sigma^2)

Since each $t^{(m)}$ is conditionally independent given $\tilde{\mathbf{x}}^{(m)}$ and $\tilde{\mathbf{w}}$ .

Using the Gaussian assumption:

P(t^{(m)} \mid \tilde{\mathbf{x}}^{(m)}, \tilde{\mathbf{w}}, \sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{\left( t^{(m)} - \tilde{\mathbf{w}}^\top \tilde{\mathbf{x}}^{(m)} \right)^2}{2\sigma^2} \right)

Log-Likelihood Function

To simplify computations, we take the logarithm of the likelihood function:

\begin{align*} \mathcal{L}(\tilde{\mathbf{w}}, \sigma^2) &= \ln P(\mathbf{t} \mid \mathbf{X}, \tilde{\mathbf{w}}, \sigma^2) \\ &= \sum_{m=1}^{M} \ln P(t^{(m)} \mid \tilde{\mathbf{x}}^{(m)}, \tilde{\mathbf{w}}, \sigma^2) \\ &= -\frac{M}{2} \ln (2\pi \sigma^2) - \frac{1}{2\sigma^2} \sum_{m=1}^{M} \left( t^{(m)} - \tilde{\mathbf{w}}^\top \tilde{\mathbf{x}}^{(m)} \right)^2 \end{align*}

Maximum Likelihood Estimation (MLE)

We aim to find the parameters $\tilde{\mathbf{w}}$ and $\sigma^2$ that maximize the log-likelihood function:

\tilde{\mathbf{w}}_{\text{MLE}}, \sigma^2_{\text{MLE}} = \arg \max_{\tilde{\mathbf{w}}, \sigma^2} \mathcal{L}(\tilde{\mathbf{w}}, \sigma^2)

Estimating $\tilde{\mathbf{w}}$

Since $\sigma^2$ appears in both terms, we can simplify the problem by focusing on $\tilde{\mathbf{w}}$ first.

Objective Function (Negative Log-Likelihood):
$\begin{align*} J(\tilde{\mathbf{w}}) &= \frac{1}{2} \sum_{m=1}^{M} \left( t^{(m)} - \tilde{\mathbf{w}}^\top \tilde{\mathbf{x}}^{(m)} \right)^2 \\ &= \frac{1}{2} \| \mathbf{t} - \mathbf{X} \tilde{\mathbf{w}} \|^2 \end{align*}$
Minimizing $J(\tilde{\mathbf{w}})$ is equivalent to maximizing the log-likelihood $\mathcal{L}(\tilde{\mathbf{w}}, \sigma^2)$ (ignoring constants and $\sigma^2$ ).

Connection to MSE Loss

The objective function $J(\tilde{\mathbf{w}})$ derived from the negative log-likelihood is proportional to the MSE loss function used in linear regression.

Conclusion: Maximizing the likelihood under the Gaussian noise assumption leads to minimizing the MSE loss.

Estimating $\sigma^2$

After finding $\tilde{\mathbf{w}}_{\text{MLE}}$ , we can estimate $\sigma^2$ :

\sigma^2_{\text{MLE}} = \frac{1}{M} \sum_{m=1}^{M} \left( t^{(m)} - \tilde{\mathbf{w}}_{\text{MLE}}^\top \tilde{\mathbf{x}}^{(m)} \right)^2

This is the sample variance of the residuals.

Implications and Insights

Justification of MSE Loss: The MSE loss function is not just a convenient choice; it arises naturally from probabilistic assumptions about the data generation process.
Assumption of Gaussian Noise: The derivation relies on the assumption that the noise in the target variable is Gaussian and homoscedastic (constant variance).
Interpretation of Linear Regression: Linear regression can be viewed as finding the maximum likelihood estimate of the parameters under the Gaussian noise model.
Extension to Other Distributions: If the noise follows a different distribution, the corresponding loss function would change (e.g., Laplace noise leads to the mean absolute error loss).

The Role of Probability in Machine Learning​

Recap of Linear Regression and MSE Loss​

Linear Regression Model​

Mean Squared Error Loss Function​

Basics of Probability Theory​

Probability Axioms​

Joint, Marginal, and Conditional Probabilities​

Bayes' Theorem​

Independence​

Expectation​

Frequentist vs. Bayesian Interpretations​

Frequentist Interpretation​

Bayesian Interpretation​

Gaussian Distribution​

Univariate Gaussian Distribution​

Multivariate Gaussian Distribution​

Probabilistic Interpretation of Linear Regression​

Assumptions​

Likelihood Function​

Log-Likelihood Function​

Maximum Likelihood Estimation (MLE)​

Estimating w~\tilde{\mathbf{w}}w~​

Connection to MSE Loss​

Estimating σ2\sigma^2σ2​

Implications and Insights​