Skip to main content

Kernel Methods and Prototype Representations

info

In this chapter, you'll learn about:

  • Prototype Representations: Describing a prediction problem using representative examples or class centers.
  • Nearest-Neighbor Style Methods: Making predictions from local similarity in the input space.
  • Feature Maps and Kernels: Extending linear models by measuring similarity in richer spaces.
  • The Kernel Trick: Computing inner products in high-dimensional feature spaces without constructing those features explicitly.
  • Tradeoffs: When prototype and kernel methods are powerful, and when they become expensive or fragile.

So far, most models in this section have the form

f(x)=wx+bf(\mathbf{x}) = \mathbf{w}^\top \mathbf{x} + b

which means their decision boundaries are linear in the chosen features. That is often a strong baseline, but sometimes the geometry of the problem is more naturally expressed in terms of similarity:

  • Which training examples look most like the new input?
  • Which class prototype is closest?
  • How similar are two inputs after a nonlinear feature transformation?

Prototype methods and kernel methods answer those questions directly.

Prototype-Based Thinking

A prototype is a representative point or reference example used to summarize a region of the input space.

Examples:

  • a class mean in nearest-centroid classification
  • a stored training example in nearest-neighbor methods
  • a learned reference point in radial basis function models

The central idea is that prediction can be based on closeness to prototypes rather than only on a global linear rule.

Nearest Neighbor

The simplest prototype method is nearest neighbor.

Given a training set {(x(i),y(i))}i=1N\{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1}^{N} and a distance function dd, the 1-nearest-neighbor prediction is:

y^(x)=y(i)wherei=argminid(x,x(i))\hat{y}(\mathbf{x}) = y^{(i^*)} \quad \text{where} \quad i^* = \arg\min_i d(\mathbf{x}, \mathbf{x}^{(i)})

This method:

  • has essentially no explicit training phase
  • can represent highly nonlinear decision boundaries
  • depends strongly on the metric and feature scaling

k-Nearest Neighbors

Instead of using only one neighbor, k-NN uses the kk closest examples.

  • In classification, it usually predicts by majority vote.
  • In regression, it usually predicts by averaging the neighbors' targets.

Increasing kk usually:

  • reduces variance
  • increases bias

This makes kk itself a hyperparameter that should be selected using validation data.

Class Prototypes

Sometimes we do not want to store every training example. A cheaper approximation is to summarize each class with a small number of prototypes.

Nearest Centroid

For a class cc, define its centroid:

μc=1Nci:y(i)=cx(i)\boldsymbol{\mu}_c = \frac{1}{N_c}\sum_{i : y^{(i)}=c}\mathbf{x}^{(i)}

Then predict the class whose centroid is closest:

y^(x)=argmincd(x,μc)\hat{y}(\mathbf{x}) = \arg\min_c d(\mathbf{x}, \boldsymbol{\mu}_c)

This is fast and interpretable, but it only works well when each class is reasonably compact around a center.

From Prototypes to Feature Maps

Another way to use prototypes is to convert distances to features.

Suppose we choose prototypes p1,,pK\mathbf{p}_1, \dots, \mathbf{p}_K. We can define new features such as:

ϕj(x)=exp(xpj22σ2)\phi_j(\mathbf{x}) = \exp\left(-\frac{\|\mathbf{x} - \mathbf{p}_j\|^2}{2\sigma^2}\right)

These are radial basis features. A linear model in the transformed feature space,

f(x)=j=1Kwjϕj(x)+bf(\mathbf{x}) = \sum_{j=1}^{K} w_j \phi_j(\mathbf{x}) + b

can represent nonlinear behavior in the original input space.

This is an important bridge:

  • prototypes define local regions of influence
  • a linear model combines those local responses globally

Kernel Methods

Kernel methods start from a related observation: sometimes we want a linear model in a richer feature space ϕ(x)\phi(\mathbf{x}), but we do not want to form ϕ(x)\phi(\mathbf{x}) explicitly.

Instead, we work with a kernel function

K(x,x)=ϕ(x)ϕ(x)K(\mathbf{x}, \mathbf{x}') = \phi(\mathbf{x})^\top \phi(\mathbf{x}')

which computes the inner product in that feature space directly.

Why Kernels Help

Suppose a classifier is linear in feature space:

f(x)=wϕ(x)+bf(\mathbf{x}) = \mathbf{w}^\top \phi(\mathbf{x}) + b

If the learned weight vector can be written as a combination of training features,

w=i=1Nαiϕ(x(i))\mathbf{w} = \sum_{i=1}^{N} \alpha_i \phi(\mathbf{x}^{(i)})

then the prediction becomes

f(x)=i=1NαiK(x(i),x)+bf(\mathbf{x}) = \sum_{i=1}^{N} \alpha_i K(\mathbf{x}^{(i)}, \mathbf{x}) + b

This means we can build nonlinear predictors from pairwise similarities only.

Common Kernels

Linear Kernel

K(x,x)=xxK(\mathbf{x}, \mathbf{x}') = \mathbf{x}^\top \mathbf{x}'

This recovers the ordinary linear model.

Polynomial Kernel

K(x,x)=(xx+c)pK(\mathbf{x}, \mathbf{x}') = (\mathbf{x}^\top \mathbf{x}' + c)^p

This corresponds to a feature space containing polynomial interactions up to degree pp.

Radial Basis Function (RBF) Kernel

K(x,x)=exp(xx22σ2)K(\mathbf{x}, \mathbf{x}') = \exp\left(-\frac{\|\mathbf{x} - \mathbf{x}'\|^2}{2\sigma^2}\right)

This kernel measures local similarity. Nearby points have kernel value close to 1; distant points have kernel value close to 0.

The bandwidth σ\sigma controls how local the notion of similarity is:

  • small σ\sigma: highly local, flexible, higher variance
  • large σ\sigma: smoother, less flexible, higher bias

Prototype Methods vs. Kernel Methods

These two families are related, but they are not identical.

Prototype Methods

  • represent data through explicit reference points
  • often use distances directly
  • can be simple and interpretable
  • may store either all examples or only a few summaries

Kernel Methods

  • represent similarity implicitly through inner products in feature space
  • often support powerful convex optimization formulations
  • can produce highly nonlinear decision boundaries
  • may become expensive because they compare against many training examples

Prototype methods are usually easier to visualize. Kernel methods are often more mathematically elegant.

Choosing a Distance or Kernel

The method is only as good as its notion of similarity.

Questions to ask:

  • Should two inputs be close under Euclidean distance?
  • Do I need invariances that raw distance does not capture?
  • Are features standardized so that one coordinate does not dominate?
  • Is local similarity more important than global linear trends?

Bad features can make a sophisticated kernel method perform worse than a simple linear baseline.

Computational Tradeoffs

Prototype and kernel methods can be expensive when the training set is large.

  • k-NN requires comparing a test point with many training examples.
  • Kernel methods often require storing many support examples or a Gram matrix.
  • Prototype compression helps by reducing the number of reference points.

This creates a familiar tradeoff:

  • more stored local structure can improve flexibility
  • less stored structure improves speed and memory

When These Methods Shine

They are especially useful when:

  • the boundary is not well modeled by a single hyperplane
  • local neighborhoods contain meaningful information
  • the dataset is medium-sized and similarity is informative
  • interpretability through reference examples is helpful

They are weaker when:

  • the feature space is poorly scaled
  • the dataset is extremely large and latency matters
  • similarity in raw input space is not meaningful

You can think of kernels and prototypes as an answer to the limitations of strict linearity:

  • GLMs keep the model linear in a transformed output space.
  • Kernel and prototype methods keep the prediction rule tied to similarity structure in the input space or feature space.

Both are useful extensions of the basic linear-model toolkit, but they solve different problems.

Recap

1. What is the main idea behind prototype methods?

2. What does a kernel function compute?

3. How does decreasing the RBF bandwidth typically change the model?