Skip to main content

Supervised Learning

info

In this chapter you'll be introduced to:

  • Supervised Learning: Understanding how to formulate supervised learning tasks and the components involved.
  • Regression Problems: Focusing on regression where the target variable is continuous.
  • Machine Learning Systems: The steps involved in building a machine learning model, including defining hypotheses and selection criteria.
  • Hypothesis Class and Overfitting vs. Underfitting: Why specifying a hypothesis class is essential and the trade-offs in model complexity and their impact on performance.

What is Supervised Learning?

Supervised Learning is a type of machine learning where the model is trained using labeled data. The goal is to learn a function that maps inputs to desired outputs based on example input-output pairs. It involves two main stages:

  1. Training (Learning): Learning a model from a set of input-output pairs.
  2. Prediction (Inference): Using the learned model to predict outputs for new, unseen inputs.

In supervised learning, the task is to predict a target variable based on some input variables. The target variable is known during training, allowing the model to learn the relationship between inputs and outputs.

Formulating a Supervised Learning Task

To formulate a supervised learning task, we need to:

  • Define the Input Variables (x\mathbf{x}): Features or attributes used to make predictions.
  • Define the Target Variable (tt): The outcome we want to predict.
  • Specify the Task: For example, predicting the weight of an animal based on its size and fur texture.

Each task requires a specific target variable and input features. The formulation depends on the problem we aim to solve.

Input Features

The input in supervised learning is denoted as x\mathbf{x}, a vector of features known during both training and prediction. These features can be:

  • Real Numbers: Continuous values like size, weight, or temperature.
  • Categorical Variables: Discrete categories like animal type or color.

If we have dd features, the input vector is:

x=[x1,x2,,xd]T\mathbf{x} = [x_1, x_2, \dots, x_d]^T
Handling Categorical Features

Categorical features cannot be directly used in mathematical models. We need to convert them into numerical representations.

One-Hot Encoding

One common method is one-hot encoding, where each category is represented by a binary vector.

  • Example: Animal Type (Cat, Dog, Elephant)
    • Cat: [1, 0, 0]
    • Dog: [0, 1, 0]
    • Elephant: [0, 0, 1]

This avoids assigning arbitrary numerical values that could imply an unintended order or magnitude between categories.

We'll see other representations in latter chapters.

Output Targets

The target variable is denoted as tt. Depending on its nature, supervised learning tasks are classified as:

  • Regression: If tt is a real number (continuous).
  • Classification: If tt is a category from a finite set.

In the first few chapters of this course, we focus on regression problems.

Regression Problems

In regression, the goal is to learn a function that maps input features to a continuous output. The regression model is denoted as hh, where:

h:RdRh: \mathbb{R}^d \rightarrow \mathbb{R}
  • dd: The number of input features.
  • h(x)h(\mathbf{x}): The predicted output for input x\mathbf{x}.

Training Data

The training data consists of MM examples:

{(x(m),t(m))}m=1M\{ (\mathbf{x}^{(m)}, t^{(m)}) \}_{m=1}^M
  • x(m)\mathbf{x}^{(m)}: Input vector for the mm-th example.
  • t(m)t^{(m)}: Target output for the mm-th example.

Prediction

Given a new input x\mathbf{x}^*, the model predicts the target as:

t^=h(x)\hat{t} = h(\mathbf{x}^*)

Components of a Machine Learning System

Building a supervised learning model involves several steps:

1. Defining the Hypothesis Class

Humans define a set of possible models (functions) that the machine learning algorithm can choose from. This set is called the hypothesis class, denoted as H\mathcal{H}.

H{hh:RdR}\mathcal{H} \subseteq \{ h \mid h: \mathbb{R}^d \rightarrow \mathbb{R} \}
Why Specify a Hypothesis Class?
  • Limiting the hypothesis class prevents the model from fitting noise in the training data, i.e., overfitting.
  • Without restrictions, the model could choose any function, making learning from finite data impossible.

Overfitting

When a model is too complex, it fits the training data very well but performs poorly on new data.

  • Cause: Hypothesis class is too powerful (too many functions).

Underfitting

When a model is too simple, it cannot capture the underlying pattern of the data.

  • Cause: Hypothesis class is too limited.

In this step, we need to find a balance where the model is complex enough to capture the data patterns but simple enough to generalize well.

Example
A Comparison underfitting, overfitting and balanced fitting
A Comparison underfitting, balanced fitting and overfitting from left to right1.

2. Defining the Selection Criterion

Humans specify a selection criterion (objective function) that evaluates how well a model fits the training data. Denoted as JJ, it's a function of the hypothesis hh and the training data.

J(h)=Evaluation of h on training dataJ(h) = \text{Evaluation of } h \text{ on training data}

Our goal is to find the hypothesis hh that minimizes J(h)J(h).

3. Training the Model

The machine learning algorithm selects the best hypothesis from H\mathcal{H} by minimizing the selection criterion:

h=argminhHJ(h)h^* = \arg\min_{h \in \mathcal{H}} J(h)

4. Making Predictions

With the trained model hh^*, predictions are made on new, unseen data x\mathbf{x}:

t^=h(x)\hat{t} = h^*(\mathbf{x})

5. Evaluating the Model

The model's performance is evaluated using a measure of success or error function E(h)E(h) on unseen data:

E(h)=Evaluation of h on unseen dataE(h) = \text{Evaluation of } h \text{ on unseen data}

The objective is to assess how well the model generalizes to new data.

  • Note: E(h)E(h) is often different from the training criterion J(h)J(h).

Recap

In this chapter, we've covered:

  • Formulating Supervised Learning Tasks: Understanding how to define inputs and targets.
  • Input and Output Representation: Handling real and categorical variables.
  • Regression Problems: Focusing on predicting continuous target variables.
  • Components of Machine Learning Systems: Defining hypothesis classes, selection criteria, and the training process.

Footnotes

  1. Portions of this page are reproduced from work created and shared by Scikit-learn and used according to terms described in the BSD 3-Clause License.