Kernel Methods and Prototype Representations

info

In this chapter, you'll learn about:

Prototype Representations: Describing a prediction problem using representative examples or class centers.
Nearest-Neighbor Style Methods: Making predictions from local similarity in the input space.
Feature Maps and Kernels: Extending linear models by measuring similarity in richer spaces.
The Kernel Trick: Computing inner products in high-dimensional feature spaces without constructing those features explicitly.
Tradeoffs: When prototype and kernel methods are powerful, and when they become expensive or fragile.

So far, most models in this section have the form

f(\mathbf{x}) = \mathbf{w}^\top \mathbf{x} + b

which means their decision boundaries are linear in the chosen features. That is often a strong baseline, but sometimes the geometry of the problem is more naturally expressed in terms of similarity:

Which training examples look most like the new input?
Which class prototype is closest?
How similar are two inputs after a nonlinear feature transformation?

Prototype methods and kernel methods answer those questions directly.

Prototype-Based Thinking

A prototype is a representative point or reference example used to summarize a region of the input space.

Examples:

a class mean in nearest-centroid classification
a stored training example in nearest-neighbor methods
a learned reference point in radial basis function models

The central idea is that prediction can be based on closeness to prototypes rather than only on a global linear rule.

Nearest Neighbor

The simplest prototype method is nearest neighbor.

Given a training set $\{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1}^{N}$ and a distance function $d$ , the 1-nearest-neighbor prediction is:

\hat{y}(\mathbf{x}) = y^{(i^*)} \quad \text{where} \quad i^* = \arg\min_i d(\mathbf{x}, \mathbf{x}^{(i)})

This method:

has essentially no explicit training phase
can represent highly nonlinear decision boundaries
depends strongly on the metric and feature scaling

k-Nearest Neighbors

Instead of using only one neighbor, k-NN uses the $k$ closest examples.

In classification, it usually predicts by majority vote.
In regression, it usually predicts by averaging the neighbors' targets.

Increasing $k$ usually:

reduces variance
increases bias

This makes $k$ itself a hyperparameter that should be selected using validation data.

Class Prototypes

Sometimes we do not want to store every training example. A cheaper approximation is to summarize each class with a small number of prototypes.

Nearest Centroid

For a class $c$ , define its centroid:

\boldsymbol{\mu}_c = \frac{1}{N_c}\sum_{i : y^{(i)}=c}\mathbf{x}^{(i)}

Then predict the class whose centroid is closest:

\hat{y}(\mathbf{x}) = \arg\min_c d(\mathbf{x}, \boldsymbol{\mu}_c)

This is fast and interpretable, but it only works well when each class is reasonably compact around a center.

From Prototypes to Feature Maps

Another way to use prototypes is to convert distances to features.

Suppose we choose prototypes $\mathbf{p}_1, \dots, \mathbf{p}_K$ . We can define new features such as:

\phi_j(\mathbf{x}) = \exp\left(-\frac{\|\mathbf{x} - \mathbf{p}_j\|^2}{2\sigma^2}\right)

These are radial basis features. A linear model in the transformed feature space,

f(\mathbf{x}) = \sum_{j=1}^{K} w_j \phi_j(\mathbf{x}) + b

can represent nonlinear behavior in the original input space.

This is an important bridge:

prototypes define local regions of influence
a linear model combines those local responses globally

Kernel Methods

Kernel methods start from a related observation: sometimes we want a linear model in a richer feature space $\phi(\mathbf{x})$ , but we do not want to form $\phi(\mathbf{x})$ explicitly.

Instead, we work with a kernel function

K(\mathbf{x}, \mathbf{x}') = \phi(\mathbf{x})^\top \phi(\mathbf{x}')

which computes the inner product in that feature space directly.

Why Kernels Help

Suppose a classifier is linear in feature space:

f(\mathbf{x}) = \mathbf{w}^\top \phi(\mathbf{x}) + b

If the learned weight vector can be written as a combination of training features,

\mathbf{w} = \sum_{i=1}^{N} \alpha_i \phi(\mathbf{x}^{(i)})

then the prediction becomes

f(\mathbf{x}) = \sum_{i=1}^{N} \alpha_i K(\mathbf{x}^{(i)}, \mathbf{x}) + b

This means we can build nonlinear predictors from pairwise similarities only.

Common Kernels

Linear Kernel

K(\mathbf{x}, \mathbf{x}') = \mathbf{x}^\top \mathbf{x}'

This recovers the ordinary linear model.

Polynomial Kernel

K(\mathbf{x}, \mathbf{x}') = (\mathbf{x}^\top \mathbf{x}' + c)^p

This corresponds to a feature space containing polynomial interactions up to degree $p$ .

Radial Basis Function (RBF) Kernel

K(\mathbf{x}, \mathbf{x}') = \exp\left(-\frac{\|\mathbf{x} - \mathbf{x}'\|^2}{2\sigma^2}\right)

This kernel measures local similarity. Nearby points have kernel value close to 1; distant points have kernel value close to 0.

The bandwidth $\sigma$ controls how local the notion of similarity is:

small $\sigma$ : highly local, flexible, higher variance
large $\sigma$ : smoother, less flexible, higher bias

Prototype Methods vs. Kernel Methods

These two families are related, but they are not identical.

Prototype Methods

represent data through explicit reference points
often use distances directly
can be simple and interpretable
may store either all examples or only a few summaries

Kernel Methods

represent similarity implicitly through inner products in feature space
often support powerful convex optimization formulations
can produce highly nonlinear decision boundaries
may become expensive because they compare against many training examples

Prototype methods are usually easier to visualize. Kernel methods are often more mathematically elegant.

Choosing a Distance or Kernel

The method is only as good as its notion of similarity.

Questions to ask:

Should two inputs be close under Euclidean distance?
Do I need invariances that raw distance does not capture?
Are features standardized so that one coordinate does not dominate?
Is local similarity more important than global linear trends?

Bad features can make a sophisticated kernel method perform worse than a simple linear baseline.

Computational Tradeoffs

Prototype and kernel methods can be expensive when the training set is large.

k-NN requires comparing a test point with many training examples.
Kernel methods often require storing many support examples or a Gram matrix.
Prototype compression helps by reducing the number of reference points.

This creates a familiar tradeoff:

more stored local structure can improve flexibility
less stored structure improves speed and memory

When These Methods Shine

They are especially useful when:

the boundary is not well modeled by a single hyperplane
local neighborhoods contain meaningful information
the dataset is medium-sized and similarity is informative
interpretability through reference examples is helpful

They are weaker when:

the feature space is poorly scaled
the dataset is extremely large and latency matters
similarity in raw input space is not meaningful

You can think of kernels and prototypes as an answer to the limitations of strict linearity:

GLMs keep the model linear in a transformed output space.
Kernel and prototype methods keep the prediction rule tied to similarity structure in the input space or feature space.

Both are useful extensions of the basic linear-model toolkit, but they solve different problems.

Kernel Methods and Prototype Representations

Prototype-Based Thinking

Nearest Neighbor

k-Nearest Neighbors

Class Prototypes

Nearest Centroid

From Prototypes to Feature Maps

Kernel Methods

Why Kernels Help

Common Kernels

Linear Kernel

Polynomial Kernel

Radial Basis Function (RBF) Kernel

Prototype Methods vs. Kernel Methods

Prototype Methods

Kernel Methods

Choosing a Distance or Kernel

Computational Tradeoffs

When These Methods Shine

Recap

1. What is the main idea behind prototype methods?

2. What does a kernel function compute?

3. How does decreasing the RBF bandwidth typically change the model?

Prototype-Based Thinking​

Nearest Neighbor​

k-Nearest Neighbors​

Class Prototypes​

Nearest Centroid​

From Prototypes to Feature Maps​

Kernel Methods​

Why Kernels Help​

Common Kernels​

Linear Kernel​

Polynomial Kernel​

Radial Basis Function (RBF) Kernel​

Prototype Methods vs. Kernel Methods​

Prototype Methods​

Kernel Methods​

Choosing a Distance or Kernel​

Computational Tradeoffs​

When These Methods Shine​

Recap​

1. What is the main idea behind prototype methods?

2. What does a kernel function compute?

3. How does decreasing the RBF bandwidth typically change the model?

Prototype-Based Thinking

Nearest Neighbor

k-Nearest Neighbors

Class Prototypes

Nearest Centroid

From Prototypes to Feature Maps

Kernel Methods

Why Kernels Help

Common Kernels

Linear Kernel

Polynomial Kernel

Radial Basis Function (RBF) Kernel

Prototype Methods vs. Kernel Methods

Prototype Methods

Kernel Methods

Choosing a Distance or Kernel

Computational Tradeoffs

When These Methods Shine

Recap