Kernel Methods and Prototype Representations
In this chapter, you'll learn about:
- Prototype Representations: Describing a prediction problem using representative examples or class centers.
- Nearest-Neighbor Style Methods: Making predictions from local similarity in the input space.
- Feature Maps and Kernels: Extending linear models by measuring similarity in richer spaces.
- The Kernel Trick: Computing inner products in high-dimensional feature spaces without constructing those features explicitly.
- Tradeoffs: When prototype and kernel methods are powerful, and when they become expensive or fragile.
So far, most models in this section have the form
which means their decision boundaries are linear in the chosen features. That is often a strong baseline, but sometimes the geometry of the problem is more naturally expressed in terms of similarity:
- Which training examples look most like the new input?
- Which class prototype is closest?
- How similar are two inputs after a nonlinear feature transformation?
Prototype methods and kernel methods answer those questions directly.
Prototype-Based Thinking
A prototype is a representative point or reference example used to summarize a region of the input space.
Examples:
- a class mean in nearest-centroid classification
- a stored training example in nearest-neighbor methods
- a learned reference point in radial basis function models
The central idea is that prediction can be based on closeness to prototypes rather than only on a global linear rule.
Nearest Neighbor
The simplest prototype method is nearest neighbor.
Given a training set and a distance function , the 1-nearest-neighbor prediction is:
This method:
- has essentially no explicit training phase
- can represent highly nonlinear decision boundaries
- depends strongly on the metric and feature scaling
k-Nearest Neighbors
Instead of using only one neighbor, k-NN uses the closest examples.
- In classification, it usually predicts by majority vote.
- In regression, it usually predicts by averaging the neighbors' targets.
Increasing usually:
- reduces variance
- increases bias
This makes itself a hyperparameter that should be selected using validation data.
Class Prototypes
Sometimes we do not want to store every training example. A cheaper approximation is to summarize each class with a small number of prototypes.
Nearest Centroid
For a class , define its centroid:
Then predict the class whose centroid is closest:
This is fast and interpretable, but it only works well when each class is reasonably compact around a center.
From Prototypes to Feature Maps
Another way to use prototypes is to convert distances to features.
Suppose we choose prototypes . We can define new features such as:
These are radial basis features. A linear model in the transformed feature space,
can represent nonlinear behavior in the original input space.
This is an important bridge:
- prototypes define local regions of influence
- a linear model combines those local responses globally
Kernel Methods
Kernel methods start from a related observation: sometimes we want a linear model in a richer feature space , but we do not want to form explicitly.
Instead, we work with a kernel function
which computes the inner product in that feature space directly.
Why Kernels Help
Suppose a classifier is linear in feature space:
If the learned weight vector can be written as a combination of training features,
then the prediction becomes
This means we can build nonlinear predictors from pairwise similarities only.
Common Kernels
Linear Kernel
This recovers the ordinary linear model.
Polynomial Kernel
This corresponds to a feature space containing polynomial interactions up to degree .
Radial Basis Function (RBF) Kernel
This kernel measures local similarity. Nearby points have kernel value close to 1; distant points have kernel value close to 0.
The bandwidth controls how local the notion of similarity is:
- small : highly local, flexible, higher variance
- large : smoother, less flexible, higher bias
Prototype Methods vs. Kernel Methods
These two families are related, but they are not identical.
Prototype Methods
- represent data through explicit reference points
- often use distances directly
- can be simple and interpretable
- may store either all examples or only a few summaries
Kernel Methods
- represent similarity implicitly through inner products in feature space
- often support powerful convex optimization formulations
- can produce highly nonlinear decision boundaries
- may become expensive because they compare against many training examples
Prototype methods are usually easier to visualize. Kernel methods are often more mathematically elegant.
Choosing a Distance or Kernel
The method is only as good as its notion of similarity.
Questions to ask:
- Should two inputs be close under Euclidean distance?
- Do I need invariances that raw distance does not capture?
- Are features standardized so that one coordinate does not dominate?
- Is local similarity more important than global linear trends?
Bad features can make a sophisticated kernel method perform worse than a simple linear baseline.
Computational Tradeoffs
Prototype and kernel methods can be expensive when the training set is large.
- k-NN requires comparing a test point with many training examples.
- Kernel methods often require storing many support examples or a Gram matrix.
- Prototype compression helps by reducing the number of reference points.
This creates a familiar tradeoff:
- more stored local structure can improve flexibility
- less stored structure improves speed and memory
When These Methods Shine
They are especially useful when:
- the boundary is not well modeled by a single hyperplane
- local neighborhoods contain meaningful information
- the dataset is medium-sized and similarity is informative
- interpretability through reference examples is helpful
They are weaker when:
- the feature space is poorly scaled
- the dataset is extremely large and latency matters
- similarity in raw input space is not meaningful
You can think of kernels and prototypes as an answer to the limitations of strict linearity:
- GLMs keep the model linear in a transformed output space.
- Kernel and prototype methods keep the prediction rule tied to similarity structure in the input space or feature space.
Both are useful extensions of the basic linear-model toolkit, but they solve different problems.