Skip to main content

Probabilistic Interpretation

info

In this chapter, you'll be introduced to:

  • Probability in Machine Learning: Understanding the role of probability in modeling and interpreting machine learning algorithms.
  • Recap of Linear Regression: Revisiting the linear regression model and the mean squared error (MSE) loss function.
  • Basics of Probability Theory: Reviewing fundamental concepts, including probability axioms, joint, marginal, and conditional probabilities, and Bayes' theorem.
  • Frequentist vs. Bayesian Interpretations: Exploring different philosophical approaches to probability.
  • Gaussian Distribution: Understanding the properties of the normal distribution in both univariate and multivariate cases.
  • Maximum Likelihood Estimation (MLE): Learning how to estimate model parameters by maximizing the likelihood of observed data.
  • Probabilistic Derivation of MSE Loss: Showing how the MSE loss function arises from the MLE under Gaussian noise assumptions.

In previous chapters, we explored the linear regression model and methods for minimizing the mean squared error (MSE) loss function, including closed-form solutions and gradient-based optimization techniques. While we have used the MSE loss due to its intuitive appeal and mathematical convenience, we have yet to delve into the probabilistic foundations that justify its use.

In this chapter, we will provide a probabilistic interpretation of the linear regression model and the MSE loss function. By framing linear regression within a probabilistic context, we gain deeper insights into the assumptions underlying the model and establish a theoretical foundation for the MSE loss. This approach also introduces essential concepts in probability theory that are widely used in machine learning.

The Role of Probability in Machine Learning

Probability theory provides a formal framework for modeling uncertainty and making inferences based on data. In machine learning, probabilistic models allow us to:

  • Quantify Uncertainty: Assess the confidence of predictions and model parameters.
  • Incorporate Prior Knowledge: Use prior distributions to incorporate existing beliefs or information.
  • Make Probabilistic Predictions: Output probabilities rather than deterministic predictions.
  • Facilitate Learning: Derive training objectives based on probabilistic principles, such as maximizing likelihood.

While probability is a powerful tool in machine learning, it's important to note that not all machine learning methods require probabilistic interpretations. However, understanding probability enhances our ability to design, analyze, and justify models and algorithms.

Recap of Linear Regression and MSE Loss

Linear Regression Model

In linear regression, we model the relationship between input features xRd\mathbf{x} \in \mathbb{R}^d and a continuous target variable tRt \in \mathbb{R} using a linear function:

y(x)=wx+by(\mathbf{x}) = \mathbf{w}^\top \mathbf{x} + b
  • w\mathbf{w}: Weight vector (parameters of the model).
  • bb: Bias term (intercept).
  • y(x)y(\mathbf{x}): Predicted output.

For convenience, we often include the bias term within the weight vector by augmenting the input vector with a 1:

x~=[1,x1,x2,,xd]w~=[b,w1,w2,,wd]\tilde{\mathbf{x}} = [1, x_1, x_2, \dots, x_d]^\top \\ \tilde{\mathbf{w}} = [b, w_1, w_2, \dots, w_d]^\top

Mean Squared Error Loss Function

The MSE loss measures the average squared difference between the predicted outputs and the true target values:

J(w~)=12Mm=1M(y(m)t(m))2J(\tilde{\mathbf{w}}) = \frac{1}{2M} \sum_{m=1}^{M} \left( y^{(m)} - t^{(m)} \right)^2
  • MM: Number of training samples.
  • y(m)y^{(m)}: Model prediction for the mm-th sample.
  • t(m)t^{(m)}: True target value for the mm-th sample.

We aim to find the weight vector w~\tilde{\mathbf{w}} that minimizes J(w~)J(\tilde{\mathbf{w}}).

Basics of Probability Theory

To build a probabilistic interpretation of linear regression, we first revisit some fundamental concepts in probability theory.

Probability Axioms

Probability is a function that assigns a real number between 0 and 1 to events in a sample space Ω\Omega:

  1. Non-negativity:

    P(A)0for any event AΩP(A) \geq 0 \quad \text{for any event } A \subseteq \Omega
  2. Normalization:

    P(Ω)=1P(\Omega) = 1
  3. Additivity (for mutually exclusive events):

    P(iAi)=iP(Ai)if AiAj= for ijP\left( \bigcup_{i} A_i \right) = \sum_{i} P(A_i) \quad \text{if } A_i \cap A_j = \emptyset \text{ for } i \neq j

Joint, Marginal, and Conditional Probabilities

  • Joint Probability:

    P(A,B)=P(AB)P(A, B) = P(A \cap B)

    Represents the probability of both events AA and BB occurring.

  • Marginal Probability:

    P(A)=BP(A,B)P(A) = \sum_{B} P(A, B)

    The probability of event AA occurring, irrespective of BB.

  • Conditional Probability:

    P(AB)=P(A,B)P(B)if P(B)>0P(A \mid B) = \frac{P(A, B)}{P(B)} \quad \text{if } P(B) > 0

    The probability of AA occurring given that BB has occurred.

Bayes' Theorem

Bayes' theorem relates conditional probabilities in reverse order:

P(AB)=P(BA)P(A)P(B)P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B)}

This theorem allows us to update our beliefs about AA after observing BB.

Independence

Two events AA and BB are independent if:

P(A,B)=P(A)P(B)P(A, B) = P(A) P(B)

This implies that the occurrence of AA provides no information about BB, and vice versa.

Expectation

The expected value (or mean) of a random variable XX with probability distribution P(X)P(X) is:

E[X]=xxP(x)(discrete case)\mathbb{E}[X] = \sum_{x} x P(x) \quad \text{(discrete case)} E[X]=xP(x)dx(continuous case)\mathbb{E}[X] = \int_{-\infty}^{\infty} x P(x) \, dx \quad \text{(continuous case)}

The expectation operator E\mathbb{E} computes the average value of XX weighted by its probability distribution.

Frequentist vs. Bayesian Interpretations

Probability can be interpreted in different ways, leading to two primary schools of thought: frequentist and Bayesian.

Frequentist Interpretation

  • Definition: Probability represents the long-run frequency of an event occurring in repeated, identical trials.
  • Key Concept: Probabilities are objective properties of the physical world.
  • Limitations:
    • Not applicable to non-repeatable events (e.g., probability of rain tomorrow).
    • Cannot assign probabilities to hypotheses or parameters.

Bayesian Interpretation

  • Definition: Probability quantifies a degree of belief or uncertainty about an event or proposition.
  • Key Concept: Probabilities are subjective and can be updated with new evidence using Bayes' theorem.
  • Advantages:
    • Applicable to unique events and hypotheses.
    • Allows incorporating prior knowledge and updating beliefs.

Both interpretations follow the same mathematical rules but differ philosophically in the meaning assigned to probability.

Gaussian Distribution

The Gaussian (normal) distribution is a continuous probability distribution characterized by its mean and variance.

Univariate Gaussian Distribution

A random variable XX follows a normal distribution with mean μ\mu and variance σ2\sigma^2:

XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2)

The probability density function (PDF) is:

P(X=x)=12πσ2exp((xμ)22σ2)P(X = x) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(x - \mu)^2}{2\sigma^2} \right)
  • Properties:
    • Symmetric about the mean μ\mu.
    • The spread is determined by the variance σ2\sigma^2.
    • The total area under the curve is 1.

Multivariate Gaussian Distribution

A random vector XRd\mathbf{X} \in \mathbb{R}^d follows a multivariate normal distribution with mean vector μ\boldsymbol{\mu} and covariance matrix Σ\boldsymbol{\Sigma}:

XN(μ,Σ)\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})

The PDF is:

P(X=x)=1(2π)d/2Σ1/2exp(12(xμ)Σ1(xμ))P(\mathbf{X} = \mathbf{x}) = \frac{1}{(2\pi)^{d/2} |\boldsymbol{\Sigma}|^{1/2}} \exp\left( -\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) \right)
  • μ\boldsymbol{\mu}: Mean vector (d×1d \times 1).
  • Σ\boldsymbol{\Sigma}: Covariance matrix (d×dd \times d), must be positive definite.
  • Properties:
    • Generalizes the univariate normal distribution to higher dimensions.
    • The shape is determined by Σ\boldsymbol{\Sigma}.

Probabilistic Interpretation of Linear Regression

We now incorporate probability into the linear regression model by treating the target variable as a random variable influenced by Gaussian noise.

Assumptions

  • Data Generation Process:

    t(m)=w~x~(m)+ε(m)t^{(m)} = \tilde{\mathbf{w}}^\top \tilde{\mathbf{x}}^{(m)} + \varepsilon^{(m)}
    • t(m)t^{(m)}: Target variable for the mm-th sample.
    • w~\tilde{\mathbf{w}}: Weight vector (including bias).
    • x~(m)\tilde{\mathbf{x}}^{(m)}: Augmented input vector.
    • ε(m)\varepsilon^{(m)}: Noise term.
  • Noise Term:

    ε(m)N(0,σ2)\varepsilon^{(m)} \sim \mathcal{N}(0, \sigma^2)
    • Zero-mean Gaussian noise with variance σ2\sigma^2.
    • Assumed to be independent and identically distributed (i.i.d.) across samples.

Likelihood Function

The likelihood function quantifies the probability of observing the data given the model parameters:

P(tX,w~,σ2)=m=1MP(t(m)x~(m),w~,σ2)P(\mathbf{t} \mid \mathbf{X}, \tilde{\mathbf{w}}, \sigma^2) = \prod_{m=1}^{M} P(t^{(m)} \mid \tilde{\mathbf{x}}^{(m)}, \tilde{\mathbf{w}}, \sigma^2)

Since each t(m)t^{(m)} is conditionally independent given x~(m)\tilde{\mathbf{x}}^{(m)} and w~\tilde{\mathbf{w}}.

Using the Gaussian assumption:

P(t(m)x~(m),w~,σ2)=12πσ2exp((t(m)w~x~(m))22σ2)P(t^{(m)} \mid \tilde{\mathbf{x}}^{(m)}, \tilde{\mathbf{w}}, \sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{\left( t^{(m)} - \tilde{\mathbf{w}}^\top \tilde{\mathbf{x}}^{(m)} \right)^2}{2\sigma^2} \right)

Log-Likelihood Function

To simplify computations, we take the logarithm of the likelihood function:

L(w~,σ2)=lnP(tX,w~,σ2)=m=1MlnP(t(m)x~(m),w~,σ2)=M2ln(2πσ2)12σ2m=1M(t(m)w~x~(m))2\begin{align*} \mathcal{L}(\tilde{\mathbf{w}}, \sigma^2) &= \ln P(\mathbf{t} \mid \mathbf{X}, \tilde{\mathbf{w}}, \sigma^2) \\ &= \sum_{m=1}^{M} \ln P(t^{(m)} \mid \tilde{\mathbf{x}}^{(m)}, \tilde{\mathbf{w}}, \sigma^2) \\ &= -\frac{M}{2} \ln (2\pi \sigma^2) - \frac{1}{2\sigma^2} \sum_{m=1}^{M} \left( t^{(m)} - \tilde{\mathbf{w}}^\top \tilde{\mathbf{x}}^{(m)} \right)^2 \end{align*}

Maximum Likelihood Estimation (MLE)

We aim to find the parameters w~\tilde{\mathbf{w}} and σ2\sigma^2 that maximize the log-likelihood function:

w~MLE,σMLE2=argmaxw~,σ2L(w~,σ2)\tilde{\mathbf{w}}_{\text{MLE}}, \sigma^2_{\text{MLE}} = \arg \max_{\tilde{\mathbf{w}}, \sigma^2} \mathcal{L}(\tilde{\mathbf{w}}, \sigma^2)

Estimating w~\tilde{\mathbf{w}}

Since σ2\sigma^2 appears in both terms, we can simplify the problem by focusing on w~\tilde{\mathbf{w}} first.

  • Objective Function (Negative Log-Likelihood):

    J(w~)=12m=1M(t(m)w~x~(m))2=12tXw~2\begin{align*} J(\tilde{\mathbf{w}}) &= \frac{1}{2} \sum_{m=1}^{M} \left( t^{(m)} - \tilde{\mathbf{w}}^\top \tilde{\mathbf{x}}^{(m)} \right)^2 \\ &= \frac{1}{2} \| \mathbf{t} - \mathbf{X} \tilde{\mathbf{w}} \|^2 \end{align*}
  • Minimizing J(w~)J(\tilde{\mathbf{w}}) is equivalent to maximizing the log-likelihood L(w~,σ2)\mathcal{L}(\tilde{\mathbf{w}}, \sigma^2) (ignoring constants and σ2\sigma^2).

Connection to MSE Loss

The objective function J(w~)J(\tilde{\mathbf{w}}) derived from the negative log-likelihood is proportional to the MSE loss function used in linear regression.

  • Conclusion: Maximizing the likelihood under the Gaussian noise assumption leads to minimizing the MSE loss.

Estimating σ2\sigma^2

After finding w~MLE\tilde{\mathbf{w}}_{\text{MLE}}, we can estimate σ2\sigma^2:

σMLE2=1Mm=1M(t(m)w~MLEx~(m))2\sigma^2_{\text{MLE}} = \frac{1}{M} \sum_{m=1}^{M} \left( t^{(m)} - \tilde{\mathbf{w}}_{\text{MLE}}^\top \tilde{\mathbf{x}}^{(m)} \right)^2

This is the sample variance of the residuals.

Implications and Insights

  • Justification of MSE Loss: The MSE loss function is not just a convenient choice; it arises naturally from probabilistic assumptions about the data generation process.
  • Assumption of Gaussian Noise: The derivation relies on the assumption that the noise in the target variable is Gaussian and homoscedastic (constant variance).
  • Interpretation of Linear Regression: Linear regression can be viewed as finding the maximum likelihood estimate of the parameters under the Gaussian noise model.
  • Extension to Other Distributions: If the noise follows a different distribution, the corresponding loss function would change (e.g., Laplace noise leads to the mean absolute error loss).

Conclusion

By incorporating probabilistic reasoning into linear regression, we've established a theoretical foundation for the use of the MSE loss function. The assumption of Gaussian noise in the data generation process leads directly to the MSE loss through maximum likelihood estimation.

This probabilistic interpretation enhances our understanding of linear regression and provides a framework for extending the model. For example, we can now consider different noise models, incorporate prior beliefs (leading to Bayesian linear regression), or handle heteroscedasticity (variable noise levels).

In the next chapter, we will delve deeper into probabilistic models and explore how they can be used to make predictions and infer parameters in more complex settings.

Recap

In this chapter, we've covered:

  • Probability in Machine Learning: Recognized the importance of probability in modeling uncertainty and informing model design.
  • Basics of Probability Theory: Reviewed key concepts such as probability axioms, joint and conditional probabilities, independence, and expectation.
  • Frequentist vs. Bayesian Interpretations: Explored different philosophical views on probability and their implications.
  • Gaussian Distribution: Understood the properties and importance of the normal distribution in modeling continuous variables.
  • Probabilistic Interpretation of Linear Regression:
    • Assumed a data generation process involving Gaussian noise.
    • Derived the likelihood function for the data given the model parameters.
    • Used maximum likelihood estimation to connect the probabilistic model with the MSE loss function.
  • Implications: Realized that the MSE loss function arises naturally from probabilistic assumptions, providing a solid theoretical justification for its use.