Skip to main content

Bayesian Learning

info

In this chapter, you'll learn about:

  • Bayesian Learning Principles: Understanding the Bayesian framework and how it differs from frequentist approaches.
  • Maximum Likelihood and MAP Estimation: Reviewing MLE and MAP in the context of Bayesian inference.
  • Bayesian Linear Regression: Applying Bayesian methods to linear regression models.
  • Predictive Distributions: Deriving the predictive distribution for unseen data.
  • Advantages of Bayesian Learning: Exploring the benefits, such as uncertainty quantification and online learning.
  • Hierarchical Bayesian Models: Introducing hyperpriors and empirical Bayes methods.

In previous chapters, we explored parameter estimation techniques like Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP) Estimation. These methods provide point estimates of the parameters. However, they do not capture the uncertainty associated with these estimates.

In this chapter, we delve into Bayesian Learning, a probabilistic framework that models uncertainty by treating parameters as random variables with prior distributions. We will apply Bayesian principles to linear regression, leading to Bayesian Linear Regression, and discuss the computation of predictive distributions for new data points.

Review of MLE and MAP Estimation

Maximum Likelihood Estimation (MLE)

  • Framework: Frequentist perspective.
  • Assumption: Parameters θ\theta are unknown but fixed constants.
  • Objective: θ^MLE=argmaxθ  p(Dθ)\hat{\theta}_{\text{MLE}} = \arg \max_{\theta} \; p(D \mid \theta)
  • Interpretation: Find the parameter value that makes the observed data most probable.

Maximum A Posteriori (MAP) Estimation

  • Framework: Bayesian perspective.
  • Assumption: Parameters θ\theta are random variables with a prior distribution p(θ)p(\theta).
  • Objective: θ^MAP=argmaxθ  p(θD)\hat{\theta}_{\text{MAP}} = \arg \max_{\theta} \; p(\theta \mid D)
  • Using Bayes' Theorem: p(θD)p(Dθ)p(θ)p(\theta \mid D) \propto p(D \mid \theta) \, p(\theta)
  • Interpretation: Find the parameter value that is most probable given the data and prior belief.

Bayesian Learning Principles

Bayesian Framework

  • Parameters as Random Variables: All unknown quantities are treated as random variables.
  • Prior Distribution: Represents our belief about the parameters before observing data.
  • Posterior Distribution: Updated belief after observing data, computed using Bayes' theorem.

Bayesian Decision Theory

  • Goal: Make predictions or decisions that minimize expected loss.
  • Predictive Distribution: Instead of a point estimate, we compute the distribution over possible outcomes by integrating over all parameter values.

Predictive Distribution

The predictive distribution for a new data point x\mathbf{x}^* is given by:

p(tx,D)=p(tx,θ)p(θD)dθp(t^* \mid \mathbf{x}^*, D) = \int p(t^* \mid \mathbf{x}^*, \theta) \, p(\theta \mid D) \, d\theta
  • p(tx,θ)p(t^* \mid \mathbf{x}^*, \theta): Likelihood of the target given parameters.
  • p(θD)p(\theta \mid D): Posterior distribution over parameters.

Bayesian Linear Regression

Model Specification

  • Likelihood:

    p(tX,w,β)=N(tXw,β1I)p(\mathbf{t} \mid \mathbf{X}, \mathbf{w}, \beta) = \mathcal{N}(\mathbf{t} \mid \mathbf{X} \mathbf{w}, \beta^{-1} \mathbf{I})
    • t\mathbf{t}: Target vector.
    • X\mathbf{X}: Design matrix.
    • w\mathbf{w}: Weight vector (parameters).
    • β\beta: Precision (inverse of variance) of the noise.
  • Prior over Weights:

    p(w)=N(wm0,S0)p(\mathbf{w}) = \mathcal{N}(\mathbf{w} \mid \mathbf{m}_0, \mathbf{S}_0)
    • m0\mathbf{m}_0: Prior mean.
    • S0\mathbf{S}_0: Prior covariance matrix.

Posterior Distribution

Using Bayes' theorem, the posterior over w\mathbf{w} is:

p(wt,X)=N(wmN,SN)p(\mathbf{w} \mid \mathbf{t}, \mathbf{X}) = \mathcal{N}(\mathbf{w} \mid \mathbf{m}_N, \mathbf{S}_N)
  • Posterior Mean: mN=SN(S01m0+βXt)\mathbf{m}_N = \mathbf{S}_N \left( \mathbf{S}_0^{-1} \mathbf{m}_0 + \beta \mathbf{X}^\top \mathbf{t} \right)
  • Posterior Covariance: SN=(S01+βXX)1\mathbf{S}_N = \left( \mathbf{S}_0^{-1} + \beta \mathbf{X}^\top \mathbf{X} \right)^{-1}

Derivation Highlights

  • Combining Gaussians: The prior and likelihood are Gaussian, leading to a Gaussian posterior (conjugate prior).
  • Completing the Square: Used to derive the expressions for mN\mathbf{m}_N and SN\mathbf{S}_N.

Predictive Distribution

The predictive distribution for a new input x\mathbf{x}^* is obtained by integrating over the posterior distribution of w\mathbf{w}:

p(tx,D)=p(tx,w)p(wD)dwp(t^* \mid \mathbf{x}^*, D) = \int p(t^* \mid \mathbf{x}^*, \mathbf{w}) \, p(\mathbf{w} \mid D) \, d\mathbf{w}

Since both p(tx,w)p(t^* \mid \mathbf{x}^*, \mathbf{w}) and p(wD)p(\mathbf{w} \mid D) are Gaussian, the predictive distribution is also Gaussian:

p(tx,D)=N(tμ(x),σ2(x))p(t^* \mid \mathbf{x}^*, D) = \mathcal{N}\left( t^* \mid \mu(\mathbf{x}^*), \sigma^2(\mathbf{x}^*) \right)
  • Predictive Mean: μ(x)=xmN\mu(\mathbf{x}^*) = \mathbf{x}^{*\top} \mathbf{m}_N
  • Predictive Variance: σ2(x)=1β+xSNx\sigma^2(\mathbf{x}^*) = \frac{1}{\beta} + \mathbf{x}^{*\top} \mathbf{S}_N \mathbf{x}^*

Interpretation

  • Mean Prediction: Centered at the MAP estimate.
  • Uncertainty Quantification:
    • The variance consists of two parts:
      • Data Noise: 1β\frac{1}{\beta}
      • Model Uncertainty: xSNx\mathbf{x}^{*\top} \mathbf{S}_N \mathbf{x}^*
    • As more data is observed, SN\mathbf{S}_N shrinks, reducing uncertainty.

Advantages of Bayesian Learning

Uncertainty Quantification

  • Provides a measure of confidence in predictions.
  • Helps in decision-making under uncertainty.

Online Learning

  • Posterior distribution can be updated incrementally as new data arrives.
  • Prior for new data becomes the posterior from previous data.

Avoiding Overfitting

  • Integrating over parameters prevents over-reliance on any single parameter estimate.
  • Regularization is naturally incorporated through the prior.

Flexibility in Modeling

  • Can incorporate prior knowledge through the choice of prior distributions.
  • Hierarchical models allow for modeling parameters of the priors themselves.

Hierarchical Bayesian Models

Introducing Hyperparameters

  • Hyperparameters: Parameters that govern the prior distributions (e.g., α\alpha and β\beta).
  • Instead of fixing hyperparameters, we can treat them as random variables with their own priors (hyperpriors).

Empirical Bayes (Type II MLE)

  • Estimate hyperparameters by maximizing the marginal likelihood of the data.
  • Objective: α^,β^=argmaxα,β  p(tX,α,β)\hat{\alpha}, \hat{\beta} = \arg \max_{\alpha, \beta} \; p(\mathbf{t} \mid \mathbf{X}, \alpha, \beta)
  • Involves integrating out w\mathbf{w}: p(tX,α,β)=p(tX,w,β)p(wα)dwp(\mathbf{t} \mid \mathbf{X}, \alpha, \beta) = \int p(\mathbf{t} \mid \mathbf{X}, \mathbf{w}, \beta) \, p(\mathbf{w} \mid \alpha) \, d\mathbf{w}
  • This approach balances model complexity and data fit.

Full Bayesian Treatment

  • Integrate over both parameters and hyperparameters: p(tx,D)=p(tx,w,β)p(w,βD)dwdβp(t^* \mid \mathbf{x}^*, D) = \int \int p(t^* \mid \mathbf{x}^*, \mathbf{w}, \beta) \, p(\mathbf{w}, \beta \mid D) \, d\mathbf{w} \, d\beta
  • Computationally intensive and often requires approximation methods like Markov Chain Monte Carlo (MCMC).

Practical Considerations

Choosing Priors

  • Conjugate Priors: Simplify computations (e.g., Gaussian priors for Gaussian likelihoods).
  • Non-informative Priors: Used when little prior knowledge is available (e.g., uniform or broad Gaussians).
  • Subjective Priors: Incorporate expert knowledge into the model.

Computational Challenges

  • Closed-form Solutions: Available for simple models like Bayesian linear regression.
  • Approximation Methods:
    • Variational Inference: Approximate the posterior with a simpler distribution.
    • Sampling Methods: Use MCMC to draw samples from the posterior.

When to Use Bayesian Methods

  • Small Datasets: When data is scarce, prior knowledge can significantly improve performance.
  • Uncertainty Matters: In critical applications where understanding uncertainty is important (e.g., medical diagnosis).
  • Online Learning: When data arrives sequentially, and the model needs continuous updating.

Comparison with MLE and MAP

AspectMLEMAPBayesian Learning
ParametersFixed unknown constantsRandom variables with priorsRandom variables integrated out
EstimationPoint estimatePoint estimatePosterior distribution
Prior KnowledgeNot incorporatedIncorporated via priorsFully utilized and updated
Predictive DistributionDeterministic (given θ^\hat{\theta})Deterministic (given θ^MAP\hat{\theta}_{\text{MAP}})Probabilistic, accounts for uncertainty
Overfitting ControlRelies on data sizeControlled via priors (regularization)Naturally mitigated through integration

Conclusion

Bayesian learning offers a comprehensive framework for statistical inference by treating all unknown quantities as random variables with probability distributions. In the context of linear regression, Bayesian methods provide not only point estimates but also quantify the uncertainty associated with predictions.

While Bayesian methods can be computationally demanding, they are valuable in scenarios where uncertainty quantification is essential, data is limited, or models need to adapt online. Understanding the principles of Bayesian learning enhances our ability to build robust models that make well-informed predictions.

Recap

In this chapter, we've covered:

  • Bayesian Learning Principles: Emphasized treating parameters as random variables and integrating over uncertainties.
  • Review of MLE and MAP: Revisited frequentist and Bayesian point estimation methods.
  • Bayesian Linear Regression: Applied Bayesian methods to linear regression, deriving the posterior over weights.
  • Predictive Distributions: Computed the predictive distribution for new data points, highlighting uncertainty quantification.
  • Advantages of Bayesian Learning: Discussed benefits like uncertainty modeling and online learning capabilities.
  • Hierarchical Bayesian Models: Introduced methods for handling hyperparameters within the Bayesian framework.