Skip to main content

Generalization Error and Hyperparameter Selection

info

In this chapter, you'll learn about:

  • Generalization: The gap between fitting the training set and performing well on new data.
  • Train / Validation / Test Splits: What each split is for and why mixing them is dangerous.
  • Cross-Validation: How to estimate out-of-sample performance when data is limited.
  • Hyperparameter Selection: Choosing model complexity, regularization strength, and other design choices.
  • Practical Evaluation Workflow: A disciplined way to compare models without leaking information.

Up to this point, most of our objectives were defined on the training data. That is necessary for learning, but it is not the real goal. The real goal is to perform well on new data drawn from the same underlying process.

This distinction is the core of machine learning:

  • We train on one finite dataset.
  • We care about performance on future examples.

The bridge between those two is called generalization.

Training Error vs. Generalization Error

Let hh be a trained model and let (h(x),y)\ell(h(\mathbf{x}), y) be a loss function.

The training error is the average loss on the data we optimized on:

R^train(h)=1Ni=1N(h(x(i)),y(i))\hat{R}_{\text{train}}(h) = \frac{1}{N}\sum_{i=1}^{N} \ell(h(\mathbf{x}^{(i)}), y^{(i)})

The generalization error is the expected loss on new samples from the data distribution:

R(h)=E(x,y)D[(h(x),y)]R(h) = \mathbb{E}_{(\mathbf{x}, y) \sim \mathcal{D}}[\ell(h(\mathbf{x}), y)]

We can compute the training error exactly, but we usually cannot compute the true generalization error because the data distribution D\mathcal{D} is unknown.

This creates the main evaluation problem in ML: we need to estimate future performance from finite data.

Why Training Error Is Not Enough

A flexible model can drive training error very low by fitting idiosyncrasies of the dataset, including noise. This is overfitting.

Two models can therefore have very different behavior:

  • Model A has slightly higher training error but performs well on new data.
  • Model B has near-zero training error but poor test-time behavior.

The second model looks better if you only inspect the training objective, but it is worse for the actual task.

Bias, Variance, and Irreducible Noise

One useful mental model is the bias-variance decomposition for squared loss:

E[(h(x)y)2]=Bias(x)2+Variance(x)+σ2\mathbb{E}\left[(h(\mathbf{x}) - y)^2\right] = \text{Bias}(\mathbf{x})^2 + \text{Variance}(\mathbf{x}) + \sigma^2

where:

  • Bias measures how far the average prediction is from the true target function.
  • Variance measures how much the learned predictor changes when the training set changes.
  • σ2\sigma^2 is irreducible noise in the data-generating process.

This decomposition is most precise in regression with squared loss, but the intuition is broader:

  • Very simple models often have high bias and low variance.
  • Very flexible models often have low bias and high variance.

Good model selection is largely about finding the right balance.

Data Splits

The standard workflow uses three roles for data.

Training Set

Used to fit parameters:

  • weights in linear regression
  • coefficients in logistic regression
  • prototype locations or kernel weights in richer models

Validation Set

Used to make modeling decisions such as:

  • polynomial degree
  • regularization strength λ\lambda
  • learning rate or batch size
  • which model family to use

Test Set

Used once, at the end, to estimate final performance after all design choices are fixed.

The test set should act like a proxy for future deployment data. If you repeatedly tune against it, it stops being a fair estimate.

A Safe Evaluation Workflow

  1. Split data into train / validation / test.
  2. Train candidate models on the training split.
  3. Compare them on the validation split.
  4. Select the best configuration.
  5. Retrain on train + validation if appropriate.
  6. Evaluate once on the test set.

This keeps model selection separate from final reporting.

Cross-Validation

When the dataset is too small to afford a large hold-out validation set, cross-validation gives a better estimate of generalization performance.

K-Fold Cross-Validation

In KK-fold cross-validation:

  1. Split the data into KK folds.
  2. For each fold:
    • train on the other K1K-1 folds
    • evaluate on the held-out fold
  3. Average the validation scores

If s1,,sKs_1, \dots, s_K are the fold scores, the cross-validation estimate is:

R^CV=1Kk=1Ksk\hat{R}_{\text{CV}} = \frac{1}{K}\sum_{k=1}^{K} s_k

Typical choices are K=5K=5 or K=10K=10.

When Cross-Validation Helps

  • Small or medium datasets where every example matters.
  • Model comparison when the variance of a single split is too high.
  • Hyperparameter tuning for classical ML models.

Limitations

  • It is computationally more expensive than a single hold-out split.
  • It can still be noisy for unstable models.
  • For time series or grouped data, standard random folds may be invalid.

Hyperparameters

A parameter is learned from data during training. A hyperparameter is chosen outside the inner training loop.

Examples:

  • regularization coefficient λ\lambda
  • polynomial degree
  • number of neighbors in k-NN
  • kernel bandwidth in an RBF kernel
  • learning rate

Hyperparameters control model capacity, optimization behavior, or inductive bias. They are selected using validation performance, not training performance.

Validation Curves and Model Complexity

A useful pattern is to plot training and validation error as a function of model complexity.

Typical behavior:

  • Low complexity: both training and validation error are high.
  • Moderate complexity: validation error improves.
  • High complexity: training error keeps dropping, but validation error eventually rises.

This curve is a concrete way to see the bias-variance tradeoff.

Common Hyperparameter Search Strategies

Define a small discrete grid and evaluate every combination.

Pros:

  • easy to implement
  • easy to reproduce

Cons:

  • expensive when many hyperparameters matter
  • wastes trials on unimportant dimensions

Sample hyperparameters from distributions instead of enumerating a fixed grid.

Pros:

  • often more efficient than grid search
  • explores wide ranges better

Cons:

  • results vary unless the seed is fixed

For expensive models, people often use:

  • Bayesian optimization
  • bandit-style schedulers
  • successive halving / Hyperband

Those methods are useful, but the main conceptual point is unchanged: hyperparameters must be chosen using data that was not used to fit the parameters.

Data Leakage

One of the fastest ways to get misleadingly good results is data leakage.

Leakage happens when information from validation or test data influences training.

Common causes:

  • normalizing features using the full dataset before splitting
  • selecting features using the full dataset
  • tuning hyperparameters on the test set
  • duplicating near-identical examples across train and test

The safe rule is:

  • fit preprocessing steps on the training split only
  • apply those fitted transformations to validation and test

Nested Model Selection

If you want an unbiased estimate of the full model-selection procedure itself, use nested cross-validation:

  • outer loop estimates final performance
  • inner loop chooses hyperparameters

This is more expensive, but it avoids optimistic bias when datasets are small and model selection is substantial.

Practical Heuristics

  • Use a single validation split when data is abundant.
  • Use cross-validation when data is limited and models are not too expensive.
  • Report the test result once after all design choices are locked.
  • Keep track of the generalization gap:
Gap=R^valR^train\text{Gap} = \hat{R}_{\text{val}} - \hat{R}_{\text{train}}

A large positive gap is a warning sign for overfitting.

Early Stopping

For iterative methods such as gradient descent, the number of training iterations behaves like a hyperparameter.

If validation loss starts increasing while training loss keeps decreasing, stopping early can improve generalization. This is especially common in larger or more flexible models.

What Good Reporting Looks Like

When comparing models, document:

  • the data split strategy
  • the evaluation metric
  • which hyperparameters were tuned
  • how they were selected
  • the final test result

Without that, strong-looking results are hard to trust.

Recap

1. Which split should be used to choose hyperparameters?

2. What is the main danger of repeatedly checking the test set during model development?

3. When is k-fold cross-validation especially useful?

What's Next

In the next chapter, we relax the purely linear view and look at prototype-based and kernel-based methods, which can represent more flexible decision boundaries while still preserving useful geometric intuition.