Skip to main content

Bayesian Learning

info

In this chapter, you'll learn about:

Bayesian Learning Principles: Understanding the Bayesian framework and how it differs from frequentist approaches.
Maximum Likelihood and MAP Estimation: Reviewing MLE and MAP in the context of Bayesian inference.
Bayesian Linear Regression: Applying Bayesian methods to linear regression models.
Predictive Distributions: Deriving the predictive distribution for unseen data.
Advantages of Bayesian Learning: Exploring the benefits, such as uncertainty quantification and online learning.
Hierarchical Bayesian Models: Introducing hyperpriors and empirical Bayes methods.

In previous chapters, we explored parameter estimation techniques like Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP) Estimation. These methods provide point estimates of the parameters. However, they do not capture the uncertainty associated with these estimates.

In this chapter, we delve into Bayesian Learning, a probabilistic framework that models uncertainty by treating parameters as random variables with prior distributions. We will apply Bayesian principles to linear regression, leading to Bayesian Linear Regression, and discuss the computation of predictive distributions for new data points.

Review of MLE and MAP Estimation

Maximum Likelihood Estimation (MLE)

Framework: Frequentist perspective.
Assumption: Parameters $\theta$ are unknown but fixed constants.
Objective: $\hat{\theta}_{\text{MLE}} = \arg \max_{\theta} \; p(D \mid \theta)$
Interpretation: Find the parameter value that makes the observed data most probable.

Maximum A Posteriori (MAP) Estimation

Framework: Bayesian perspective.
Assumption: Parameters $\theta$ are random variables with a prior distribution $p(\theta)$ .
Objective: $\hat{\theta}_{\text{MAP}} = \arg \max_{\theta} \; p(\theta \mid D)$
Using Bayes' Theorem: $p(\theta \mid D) \propto p(D \mid \theta) \, p(\theta)$
Interpretation: Find the parameter value that is most probable given the data and prior belief.

Bayesian Learning Principles

Bayesian Framework

Parameters as Random Variables: All unknown quantities are treated as random variables.
Prior Distribution: Represents our belief about the parameters before observing data.
Posterior Distribution: Updated belief after observing data, computed using Bayes' theorem.

Bayesian Decision Theory

Goal: Make predictions or decisions that minimize expected loss.
Predictive Distribution: Instead of a point estimate, we compute the distribution over possible outcomes by integrating over all parameter values.

Predictive Distribution

The predictive distribution for a new data point $\mathbf{x}^*$ is given by:

p(t^* \mid \mathbf{x}^*, D) = \int p(t^* \mid \mathbf{x}^*, \theta) \, p(\theta \mid D) \, d\theta

$p(t^* \mid \mathbf{x}^*, \theta)$ : Likelihood of the target given parameters.
$p(\theta \mid D)$ : Posterior distribution over parameters.

Bayesian Linear Regression

Model Specification

Likelihood:
$p(\mathbf{t} \mid \mathbf{X}, \mathbf{w}, \beta) = \mathcal{N}(\mathbf{t} \mid \mathbf{X} \mathbf{w}, \beta^{-1} \mathbf{I})$
- $\mathbf{t}$ : Target vector.
- $\mathbf{X}$ : Design matrix.
- $\mathbf{w}$ : Weight vector (parameters).
- $\beta$ : Precision (inverse of variance) of the noise.
Prior over Weights:
$p(\mathbf{w}) = \mathcal{N}(\mathbf{w} \mid \mathbf{m}_0, \mathbf{S}_0)$
- $\mathbf{m}_0$ : Prior mean.
- $\mathbf{S}_0$ : Prior covariance matrix.

Posterior Distribution

Using Bayes' theorem, the posterior over $\mathbf{w}$ is:

p(\mathbf{w} \mid \mathbf{t}, \mathbf{X}) = \mathcal{N}(\mathbf{w} \mid \mathbf{m}_N, \mathbf{S}_N)

Posterior Mean: $\mathbf{m}_N = \mathbf{S}_N \left( \mathbf{S}_0^{-1} \mathbf{m}_0 + \beta \mathbf{X}^\top \mathbf{t} \right)$
Posterior Covariance: $\mathbf{S}_N = \left( \mathbf{S}_0^{-1} + \beta \mathbf{X}^\top \mathbf{X} \right)^{-1}$

Derivation Highlights

Combining Gaussians: The prior and likelihood are Gaussian, leading to a Gaussian posterior (conjugate prior).
Completing the Square: Used to derive the expressions for $\mathbf{m}_N$ and $\mathbf{S}_N$ .

Predictive Distribution

The predictive distribution for a new input $\mathbf{x}^*$ is obtained by integrating over the posterior distribution of $\mathbf{w}$ :

p(t^* \mid \mathbf{x}^*, D) = \int p(t^* \mid \mathbf{x}^*, \mathbf{w}) \, p(\mathbf{w} \mid D) \, d\mathbf{w}

Since both $p(t^* \mid \mathbf{x}^*, \mathbf{w})$ and $p(\mathbf{w} \mid D)$ are Gaussian, the predictive distribution is also Gaussian:

p(t^* \mid \mathbf{x}^*, D) = \mathcal{N}\left( t^* \mid \mu(\mathbf{x}^*), \sigma^2(\mathbf{x}^*) \right)

Predictive Mean: $\mu(\mathbf{x}^*) = \mathbf{x}^{*\top} \mathbf{m}_N$
Predictive Variance: $\sigma^2(\mathbf{x}^*) = \frac{1}{\beta} + \mathbf{x}^{*\top} \mathbf{S}_N \mathbf{x}^*$

Interpretation

Mean Prediction: Centered at the MAP estimate.
Uncertainty Quantification:
- The variance consists of two parts:
  - Data Noise: $\frac{1}{\beta}$
  - Model Uncertainty: $\mathbf{x}^{*\top} \mathbf{S}_N \mathbf{x}^*$
- As more data is observed, $\mathbf{S}_N$ shrinks, reducing uncertainty.

Advantages of Bayesian Learning

Uncertainty Quantification

Provides a measure of confidence in predictions.
Helps in decision-making under uncertainty.

Online Learning

Posterior distribution can be updated incrementally as new data arrives.
Prior for new data becomes the posterior from previous data.

Avoiding Overfitting

Integrating over parameters prevents over-reliance on any single parameter estimate.
Regularization is naturally incorporated through the prior.

Flexibility in Modeling

Can incorporate prior knowledge through the choice of prior distributions.
Hierarchical models allow for modeling parameters of the priors themselves.

Hierarchical Bayesian Models

Introducing Hyperparameters

Hyperparameters: Parameters that govern the prior distributions (e.g., $\alpha$ and $\beta$ ).
Instead of fixing hyperparameters, we can treat them as random variables with their own priors (hyperpriors).

Empirical Bayes (Type II MLE)

Estimate hyperparameters by maximizing the marginal likelihood of the data.
Objective: $\hat{\alpha}, \hat{\beta} = \arg \max_{\alpha, \beta} \; p(\mathbf{t} \mid \mathbf{X}, \alpha, \beta)$
Involves integrating out $\mathbf{w}$ : $p(\mathbf{t} \mid \mathbf{X}, \alpha, \beta) = \int p(\mathbf{t} \mid \mathbf{X}, \mathbf{w}, \beta) \, p(\mathbf{w} \mid \alpha) \, d\mathbf{w}$
This approach balances model complexity and data fit.

Full Bayesian Treatment

Integrate over both parameters and hyperparameters: $p(t^* \mid \mathbf{x}^*, D) = \int \int p(t^* \mid \mathbf{x}^*, \mathbf{w}, \beta) \, p(\mathbf{w}, \beta \mid D) \, d\mathbf{w} \, d\beta$
Computationally intensive and often requires approximation methods like Markov Chain Monte Carlo (MCMC).

Practical Considerations

Choosing Priors

Conjugate Priors: Simplify computations (e.g., Gaussian priors for Gaussian likelihoods).
Non-informative Priors: Used when little prior knowledge is available (e.g., uniform or broad Gaussians).
Subjective Priors: Incorporate expert knowledge into the model.

Computational Challenges

Closed-form Solutions: Available for simple models like Bayesian linear regression.
Approximation Methods:
- Variational Inference: Approximate the posterior with a simpler distribution.
- Sampling Methods: Use MCMC to draw samples from the posterior.

When to Use Bayesian Methods

Small Datasets: When data is scarce, prior knowledge can significantly improve performance.
Uncertainty Matters: In critical applications where understanding uncertainty is important (e.g., medical diagnosis).
Online Learning: When data arrives sequentially, and the model needs continuous updating.

Comparison with MLE and MAP

Aspect	MLE	MAP	Bayesian Learning
Parameters	Fixed unknown constants	Random variables with priors	Random variables integrated out
Estimation	Point estimate	Point estimate	Posterior distribution
Prior Knowledge	Not incorporated	Incorporated via priors	Fully utilized and updated
Predictive Distribution	Deterministic (given $\hat{\theta}$ )	Deterministic (given $\hat{\theta}_{\text{MAP}}$ )	Probabilistic, accounts for uncertainty
Overfitting Control	Relies on data size	Controlled via priors (regularization)	Naturally mitigated through integration