Skip to main content

Generalized Linear Models

info

In this chapter, you'll learn about:

  • Generalized Linear Models (GLMs): A unified view of linear regression, logistic regression, and other probabilistic linear models.
  • The Three Parts of a GLM: The random component, linear predictor, and link function.
  • Exponential-Family Distributions: Why Gaussian, Bernoulli, and Poisson models fit naturally into the same framework.
  • Canonical Links and Likelihoods: How the choice of link determines the loss and optimization problem.
  • Practical Modeling Choices: When GLMs are appropriate and when you need richer nonlinear models.

In the previous chapters, we studied linear regression, logistic regression, and multinomial regression as if they were separate algorithms. They are closely related. A generalized linear model (GLM) is a common template that explains all of them in one language.

The key idea is simple: we keep a linear predictor in the features, but we allow the target variable to follow a distribution that matches the task. Continuous real-valued targets suggest Gaussian models, binary targets suggest Bernoulli models, and count data often suggests Poisson models.

Why We Need GLMs

Ordinary linear regression assumes:

E[Yx]=wx+b\mathbb{E}[Y \mid \mathbf{x}] = \mathbf{w}^\top \mathbf{x} + b

This works well when:

  • The target is real-valued.
  • The conditional noise is roughly symmetric.
  • Predicting any real number is acceptable.

It breaks down when the output has structural constraints:

  • Binary labels must stay in {0,1}\{0,1\} or be interpreted as probabilities in [0,1][0,1].
  • Counts must be nonnegative integers.
  • Multiclass outputs must form a valid probability distribution whose entries sum to one.

GLMs solve this by keeping the linear part, but changing how the mean of the target distribution is connected to that linear predictor.

The Three Parts of a GLM

A GLM has three ingredients.

1. Random Component

We choose a conditional distribution for the target:

Yxp(yμ(x))Y \mid \mathbf{x} \sim p(y \mid \mu(\mathbf{x}))

where μ(x)=E[Yx]\mu(\mathbf{x}) = \mathbb{E}[Y \mid \mathbf{x}] is the conditional mean.

Typical choices are:

  • Gaussian for real-valued regression targets.
  • Bernoulli for binary labels.
  • Poisson for counts.

2. Systematic Component

We define a linear predictor:

η(x)=wx+b\eta(\mathbf{x}) = \mathbf{w}^\top \mathbf{x} + b

This is the same basic form we used in linear and logistic regression.

The link function connects the conditional mean to the linear predictor:

g(μ(x))=η(x)g(\mu(\mathbf{x})) = \eta(\mathbf{x})

Equivalently,

μ(x)=g1(η(x))\mu(\mathbf{x}) = g^{-1}(\eta(\mathbf{x}))

The inverse link is what turns the raw score η\eta into a valid mean parameter for the chosen distribution.

Exponential-Family View

Many GLMs use target distributions from the exponential family, whose densities or mass functions can be written in the form

p(yθ,ϕ)=h(y,ϕ)exp(yθA(θ)ϕ)p(y \mid \theta, \phi) = h(y, \phi)\exp\left(\frac{y\theta - A(\theta)}{\phi}\right)

You do not need this formula to use GLMs day to day, but it explains why so many models share the same optimization structure:

  • The negative log-likelihood is often convex.
  • The gradient takes a clean residual form.
  • The mean is tightly connected to the natural parameter θ\theta.

Common GLM Examples

The table below summarizes the most important cases in this section.

TaskTarget distributionMeanLink function
Linear regressionGaussianμR\mu \in \mathbb{R}Identity: g(μ)=μg(\mu)=\mu
Logistic regressionBernoulliμ(0,1)\mu \in (0,1)Logit: g(μ)=logμ1μg(\mu)=\log\frac{\mu}{1-\mu}
Multinomial regressionCategorical / multinomialclass probabilitiesSoftmax / generalized logit
Poisson regressionPoissonμ>0\mu > 0Log: g(μ)=logμg(\mu)=\log \mu

Gaussian GLM: Linear Regression

If

YxN(μ(x),σ2)Y \mid \mathbf{x} \sim \mathcal{N}(\mu(\mathbf{x}), \sigma^2)

and we use the identity link, then

μ(x)=wx+b\mu(\mathbf{x}) = \mathbf{w}^\top \mathbf{x} + b

Maximizing the likelihood leads to the squared-error objective. This is exactly ordinary least squares under Gaussian noise assumptions.

Bernoulli GLM: Logistic Regression

For binary classification,

YxBernoulli(μ(x))Y \mid \mathbf{x} \sim \text{Bernoulli}(\mu(\mathbf{x}))

and the canonical link is the logit:

logμ(x)1μ(x)=wx+b\log \frac{\mu(\mathbf{x})}{1-\mu(\mathbf{x})} = \mathbf{w}^\top \mathbf{x} + b

Solving for μ(x)\mu(\mathbf{x}) gives the sigmoid:

μ(x)=σ(wx+b)\mu(\mathbf{x}) = \sigma(\mathbf{w}^\top \mathbf{x} + b)

This is why logistic regression still uses a linear score internally, but outputs a valid probability.

Poisson GLM: Count Modeling

When the target is a count, a common model is

YxPoisson(μ(x))Y \mid \mathbf{x} \sim \text{Poisson}(\mu(\mathbf{x}))

with log link

logμ(x)=wx+b\log \mu(\mathbf{x}) = \mathbf{w}^\top \mathbf{x} + b

so that

μ(x)=exp(wx+b)\mu(\mathbf{x}) = \exp(\mathbf{w}^\top \mathbf{x} + b)

This guarantees the predicted mean stays positive. Poisson GLMs are useful for quantities like event counts, arrivals, or number of clicks.

Likelihood and Loss

Once the distribution is chosen, training a GLM usually means maximizing the conditional likelihood:

w^,b^=argmaxw,bi=1Np(y(i)x(i);w,b)\hat{\mathbf{w}}, \hat{b} = \arg\max_{\mathbf{w}, b} \prod_{i=1}^{N} p(y^{(i)} \mid \mathbf{x}^{(i)}; \mathbf{w}, b)

or, equivalently, minimizing the negative log-likelihood:

J(w,b)=i=1Nlogp(y(i)x(i);w,b)J(\mathbf{w}, b) = - \sum_{i=1}^{N} \log p(y^{(i)} \mid \mathbf{x}^{(i)}; \mathbf{w}, b)

This recovers familiar losses:

  • Gaussian + identity link gives squared error.
  • Bernoulli + logit link gives binary cross-entropy.
  • Multiclass softmax gives multiclass cross-entropy.
  • Poisson + log link gives Poisson negative log-likelihood.

This is one of the main reasons GLMs are so useful: the modeling choice and the training objective stay aligned.

For exponential-family models there is often a preferred link called the canonical link. It maps the mean parameter to the natural parameter of the distribution.

Examples:

  • Gaussian: canonical link is identity.
  • Bernoulli: canonical link is logit.
  • Poisson: canonical link is log.

Canonical links are attractive because they usually lead to cleaner gradients and convex objectives. They are not the only valid choice, but they are often the default starting point.

Interpreting Coefficients

GLMs remain attractive because the parameters still have interpretable effects.

  • In linear regression, a unit increase in feature xjx_j changes the predicted mean additively by wjw_j.
  • In logistic regression, a unit increase in xjx_j changes the log-odds linearly by wjw_j.
  • In Poisson regression, a unit increase in xjx_j changes the log expected count linearly by wjw_j.

The parameter effect is linear in the transformed space, not necessarily in the original output space.

When GLMs Work Well

GLMs are strong baselines when:

  • The features already capture the key structure of the problem.
  • You want interpretable coefficients.
  • The target distribution has obvious constraints.
  • You need a model that trains quickly and behaves predictably.

Limitations of GLMs

A GLM can still fail if the linear predictor is too restrictive.

  • Missing nonlinear structure: The true decision boundary or regression surface may not be close to linear in the chosen features.
  • Missing interactions: Important feature combinations may be absent.
  • Distribution mismatch: The chosen likelihood may poorly reflect the data.

Common fixes include:

  • Adding basis features or interactions.
  • Using kernels.
  • Moving to tree-based models or neural networks.

Recap

1. What remains linear inside a generalized linear model?

2. Which link function is canonical for Bernoulli targets?

3. Why is the log link natural for Poisson regression?

What's Next

In the next chapter, we move from model specification to model selection: how to estimate generalization error, choose hyperparameters, and avoid fooling ourselves with validation leakage.