Generalization Error and Hyperparameter Selection

info

In this chapter, you'll learn about:

Generalization: The gap between fitting the training set and performing well on new data.
Train / Validation / Test Splits: What each split is for and why mixing them is dangerous.
Cross-Validation: How to estimate out-of-sample performance when data is limited.
Hyperparameter Selection: Choosing model complexity, regularization strength, and other design choices.
Practical Evaluation Workflow: A disciplined way to compare models without leaking information.

Up to this point, most of our objectives were defined on the training data. That is necessary for learning, but it is not the real goal. The real goal is to perform well on new data drawn from the same underlying process.

This distinction is the core of machine learning:

We train on one finite dataset.
We care about performance on future examples.

The bridge between those two is called generalization.

Training Error vs. Generalization Error

Let $h$ be a trained model and let $\ell(h(\mathbf{x}), y)$ be a loss function.

The training error is the average loss on the data we optimized on:

\hat{R}_{\text{train}}(h) = \frac{1}{N}\sum_{i=1}^{N} \ell(h(\mathbf{x}^{(i)}), y^{(i)})

The generalization error is the expected loss on new samples from the data distribution:

R(h) = \mathbb{E}_{(\mathbf{x}, y) \sim \mathcal{D}}[\ell(h(\mathbf{x}), y)]

We can compute the training error exactly, but we usually cannot compute the true generalization error because the data distribution $\mathcal{D}$ is unknown.

This creates the main evaluation problem in ML: we need to estimate future performance from finite data.

Why Training Error Is Not Enough

A flexible model can drive training error very low by fitting idiosyncrasies of the dataset, including noise. This is overfitting.

Two models can therefore have very different behavior:

Model A has slightly higher training error but performs well on new data.
Model B has near-zero training error but poor test-time behavior.

The second model looks better if you only inspect the training objective, but it is worse for the actual task.

Bias, Variance, and Irreducible Noise

One useful mental model is the bias-variance decomposition for squared loss:

\mathbb{E}\left[(h(\mathbf{x}) - y)^2\right] = \text{Bias}(\mathbf{x})^2 + \text{Variance}(\mathbf{x}) + \sigma^2

where:

Bias measures how far the average prediction is from the true target function.
Variance measures how much the learned predictor changes when the training set changes.
$\sigma^2$ is irreducible noise in the data-generating process.

This decomposition is most precise in regression with squared loss, but the intuition is broader:

Very simple models often have high bias and low variance.
Very flexible models often have low bias and high variance.

Good model selection is largely about finding the right balance.

Data Splits

The standard workflow uses three roles for data.

Training Set

Used to fit parameters:

weights in linear regression
coefficients in logistic regression
prototype locations or kernel weights in richer models

Validation Set

Used to make modeling decisions such as:

polynomial degree
regularization strength $\lambda$
learning rate or batch size
which model family to use

Test Set

Used once, at the end, to estimate final performance after all design choices are fixed.

The test set should act like a proxy for future deployment data. If you repeatedly tune against it, it stops being a fair estimate.

A Safe Evaluation Workflow

Split data into train / validation / test.
Train candidate models on the training split.
Compare them on the validation split.
Select the best configuration.
Retrain on train + validation if appropriate.
Evaluate once on the test set.

This keeps model selection separate from final reporting.

Cross-Validation

When the dataset is too small to afford a large hold-out validation set, cross-validation gives a better estimate of generalization performance.

K-Fold Cross-Validation

In $K$ -fold cross-validation:

Split the data into $K$ folds.
For each fold:
- train on the other $K-1$ folds
- evaluate on the held-out fold
Average the validation scores

If $s_1, \dots, s_K$ are the fold scores, the cross-validation estimate is:

\hat{R}_{\text{CV}} = \frac{1}{K}\sum_{k=1}^{K} s_k

Typical choices are $K=5$ or $K=10$ .

When Cross-Validation Helps

Small or medium datasets where every example matters.
Model comparison when the variance of a single split is too high.
Hyperparameter tuning for classical ML models.

Limitations

It is computationally more expensive than a single hold-out split.
It can still be noisy for unstable models.
For time series or grouped data, standard random folds may be invalid.

Hyperparameters

A parameter is learned from data during training. A hyperparameter is chosen outside the inner training loop.

Examples:

regularization coefficient $\lambda$
polynomial degree
number of neighbors in k-NN
kernel bandwidth in an RBF kernel
learning rate

Hyperparameters control model capacity, optimization behavior, or inductive bias. They are selected using validation performance, not training performance.

Validation Curves and Model Complexity

A useful pattern is to plot training and validation error as a function of model complexity.

Typical behavior:

Low complexity: both training and validation error are high.
Moderate complexity: validation error improves.
High complexity: training error keeps dropping, but validation error eventually rises.

This curve is a concrete way to see the bias-variance tradeoff.

Common Hyperparameter Search Strategies

Grid Search

Define a small discrete grid and evaluate every combination.

Pros:

easy to implement
easy to reproduce

Cons:

expensive when many hyperparameters matter
wastes trials on unimportant dimensions

Random Search

Sample hyperparameters from distributions instead of enumerating a fixed grid.

Pros:

often more efficient than grid search
explores wide ranges better

Cons:

results vary unless the seed is fixed

Smarter Search

For expensive models, people often use:

Bayesian optimization
bandit-style schedulers
successive halving / Hyperband

Those methods are useful, but the main conceptual point is unchanged: hyperparameters must be chosen using data that was not used to fit the parameters.

Data Leakage

One of the fastest ways to get misleadingly good results is data leakage.

Leakage happens when information from validation or test data influences training.

Common causes:

normalizing features using the full dataset before splitting
selecting features using the full dataset
tuning hyperparameters on the test set
duplicating near-identical examples across train and test

The safe rule is:

fit preprocessing steps on the training split only
apply those fitted transformations to validation and test

Nested Model Selection

If you want an unbiased estimate of the full model-selection procedure itself, use nested cross-validation:

outer loop estimates final performance
inner loop chooses hyperparameters

This is more expensive, but it avoids optimistic bias when datasets are small and model selection is substantial.

Practical Heuristics

Use a single validation split when data is abundant.
Use cross-validation when data is limited and models are not too expensive.
Report the test result once after all design choices are locked.
Keep track of the generalization gap:

\text{Gap} = \hat{R}_{\text{val}} - \hat{R}_{\text{train}}

A large positive gap is a warning sign for overfitting.

Early Stopping

For iterative methods such as gradient descent, the number of training iterations behaves like a hyperparameter.

If validation loss starts increasing while training loss keeps decreasing, stopping early can improve generalization. This is especially common in larger or more flexible models.

What Good Reporting Looks Like

When comparing models, document:

the data split strategy
the evaluation metric
which hyperparameters were tuned
how they were selected
the final test result

Without that, strong-looking results are hard to trust.

Recap

What's Next

In the next chapter, we relax the purely linear view and look at prototype-based and kernel-based methods, which can represent more flexible decision boundaries while still preserving useful geometric intuition.

Generalization Error and Hyperparameter Selection

Training Error vs. Generalization Error

Why Training Error Is Not Enough

Bias, Variance, and Irreducible Noise

Data Splits

Training Set

Validation Set

Test Set

A Safe Evaluation Workflow

Cross-Validation

K-Fold Cross-Validation

When Cross-Validation Helps

Limitations

Hyperparameters

Validation Curves and Model Complexity

Common Hyperparameter Search Strategies

Grid Search

Random Search

Smarter Search

Data Leakage

Nested Model Selection

Practical Heuristics

Early Stopping

What Good Reporting Looks Like

Recap

1. Which split should be used to choose hyperparameters?

2. What is the main danger of repeatedly checking the test set during model development?

3. When is k-fold cross-validation especially useful?

What's Next

Training Error vs. Generalization Error​

Why Training Error Is Not Enough​

Bias, Variance, and Irreducible Noise​

Data Splits​

Training Set​

Validation Set​

Test Set​

A Safe Evaluation Workflow​

Cross-Validation​

K-Fold Cross-Validation​

When Cross-Validation Helps​

Limitations​

Hyperparameters​

Validation Curves and Model Complexity​

Common Hyperparameter Search Strategies​

Grid Search​

Random Search​

Smarter Search​

Data Leakage​

Nested Model Selection​

Practical Heuristics​

Early Stopping​

What Good Reporting Looks Like​

Recap​

1. Which split should be used to choose hyperparameters?

2. What is the main danger of repeatedly checking the test set during model development?

3. When is k-fold cross-validation especially useful?

What's Next​

Training Error vs. Generalization Error

Why Training Error Is Not Enough

Bias, Variance, and Irreducible Noise

Data Splits

Training Set

Validation Set

Test Set

A Safe Evaluation Workflow

Cross-Validation

K-Fold Cross-Validation

When Cross-Validation Helps

Limitations

Hyperparameters

Validation Curves and Model Complexity

Common Hyperparameter Search Strategies

Grid Search

Random Search

Smarter Search

Data Leakage

Nested Model Selection

Practical Heuristics

Early Stopping

What Good Reporting Looks Like

Recap

What's Next