Generalization Error and Hyperparameter Selection
In this chapter, you'll learn about:
- Generalization: The gap between fitting the training set and performing well on new data.
- Train / Validation / Test Splits: What each split is for and why mixing them is dangerous.
- Cross-Validation: How to estimate out-of-sample performance when data is limited.
- Hyperparameter Selection: Choosing model complexity, regularization strength, and other design choices.
- Practical Evaluation Workflow: A disciplined way to compare models without leaking information.
Up to this point, most of our objectives were defined on the training data. That is necessary for learning, but it is not the real goal. The real goal is to perform well on new data drawn from the same underlying process.
This distinction is the core of machine learning:
- We train on one finite dataset.
- We care about performance on future examples.
The bridge between those two is called generalization.
Training Error vs. Generalization Error
Let be a trained model and let be a loss function.
The training error is the average loss on the data we optimized on:
The generalization error is the expected loss on new samples from the data distribution:
We can compute the training error exactly, but we usually cannot compute the true generalization error because the data distribution is unknown.
This creates the main evaluation problem in ML: we need to estimate future performance from finite data.
Why Training Error Is Not Enough
A flexible model can drive training error very low by fitting idiosyncrasies of the dataset, including noise. This is overfitting.
Two models can therefore have very different behavior:
- Model A has slightly higher training error but performs well on new data.
- Model B has near-zero training error but poor test-time behavior.
The second model looks better if you only inspect the training objective, but it is worse for the actual task.
Bias, Variance, and Irreducible Noise
One useful mental model is the bias-variance decomposition for squared loss:
where:
- Bias measures how far the average prediction is from the true target function.
- Variance measures how much the learned predictor changes when the training set changes.
- is irreducible noise in the data-generating process.
This decomposition is most precise in regression with squared loss, but the intuition is broader:
- Very simple models often have high bias and low variance.
- Very flexible models often have low bias and high variance.
Good model selection is largely about finding the right balance.
Data Splits
The standard workflow uses three roles for data.
Training Set
Used to fit parameters:
- weights in linear regression
- coefficients in logistic regression
- prototype locations or kernel weights in richer models
Validation Set
Used to make modeling decisions such as:
- polynomial degree
- regularization strength
- learning rate or batch size
- which model family to use
Test Set
Used once, at the end, to estimate final performance after all design choices are fixed.
The test set should act like a proxy for future deployment data. If you repeatedly tune against it, it stops being a fair estimate.
A Safe Evaluation Workflow
- Split data into train / validation / test.
- Train candidate models on the training split.
- Compare them on the validation split.
- Select the best configuration.
- Retrain on train + validation if appropriate.
- Evaluate once on the test set.
This keeps model selection separate from final reporting.
Cross-Validation
When the dataset is too small to afford a large hold-out validation set, cross-validation gives a better estimate of generalization performance.
K-Fold Cross-Validation
In -fold cross-validation:
- Split the data into folds.
- For each fold:
- train on the other folds
- evaluate on the held-out fold
- Average the validation scores
If are the fold scores, the cross-validation estimate is:
Typical choices are or .
When Cross-Validation Helps
- Small or medium datasets where every example matters.
- Model comparison when the variance of a single split is too high.
- Hyperparameter tuning for classical ML models.
Limitations
- It is computationally more expensive than a single hold-out split.
- It can still be noisy for unstable models.
- For time series or grouped data, standard random folds may be invalid.
Hyperparameters
A parameter is learned from data during training. A hyperparameter is chosen outside the inner training loop.
Examples:
- regularization coefficient
- polynomial degree
- number of neighbors in k-NN
- kernel bandwidth in an RBF kernel
- learning rate
Hyperparameters control model capacity, optimization behavior, or inductive bias. They are selected using validation performance, not training performance.
Validation Curves and Model Complexity
A useful pattern is to plot training and validation error as a function of model complexity.
Typical behavior:
- Low complexity: both training and validation error are high.
- Moderate complexity: validation error improves.
- High complexity: training error keeps dropping, but validation error eventually rises.
This curve is a concrete way to see the bias-variance tradeoff.
Common Hyperparameter Search Strategies
Grid Search
Define a small discrete grid and evaluate every combination.
Pros:
- easy to implement
- easy to reproduce
Cons:
- expensive when many hyperparameters matter
- wastes trials on unimportant dimensions
Random Search
Sample hyperparameters from distributions instead of enumerating a fixed grid.
Pros:
- often more efficient than grid search
- explores wide ranges better
Cons:
- results vary unless the seed is fixed
Smarter Search
For expensive models, people often use:
- Bayesian optimization
- bandit-style schedulers
- successive halving / Hyperband
Those methods are useful, but the main conceptual point is unchanged: hyperparameters must be chosen using data that was not used to fit the parameters.
Data Leakage
One of the fastest ways to get misleadingly good results is data leakage.
Leakage happens when information from validation or test data influences training.
Common causes:
- normalizing features using the full dataset before splitting
- selecting features using the full dataset
- tuning hyperparameters on the test set
- duplicating near-identical examples across train and test
The safe rule is:
- fit preprocessing steps on the training split only
- apply those fitted transformations to validation and test
Nested Model Selection
If you want an unbiased estimate of the full model-selection procedure itself, use nested cross-validation:
- outer loop estimates final performance
- inner loop chooses hyperparameters
This is more expensive, but it avoids optimistic bias when datasets are small and model selection is substantial.
Practical Heuristics
- Use a single validation split when data is abundant.
- Use cross-validation when data is limited and models are not too expensive.
- Report the test result once after all design choices are locked.
- Keep track of the generalization gap:
A large positive gap is a warning sign for overfitting.
Early Stopping
For iterative methods such as gradient descent, the number of training iterations behaves like a hyperparameter.
If validation loss starts increasing while training loss keeps decreasing, stopping early can improve generalization. This is especially common in larger or more flexible models.
What Good Reporting Looks Like
When comparing models, document:
- the data split strategy
- the evaluation metric
- which hyperparameters were tuned
- how they were selected
- the final test result
Without that, strong-looking results are hard to trust.
Recap
1. Which split should be used to choose hyperparameters?
2. What is the main danger of repeatedly checking the test set during model development?
3. When is k-fold cross-validation especially useful?
What's Next
In the next chapter, we relax the purely linear view and look at prototype-based and kernel-based methods, which can represent more flexible decision boundaries while still preserving useful geometric intuition.