Softmax Regression

info

In this chapter, you'll learn about:

Limitations of Linear Regression for Classification: Understanding why mean squared error is not ideal for classification tasks.
Extension from Binary to Multiclass Classification: Generalizing logistic regression to handle multiple classes.
Softmax Function: Introducing the softmax function to model multiclass probabilities.
Cross-Entropy Loss for Multiclass Classification: Formulating the loss function suitable for multiclass settings.
Deriving the Gradient for Optimization: Computing gradients for parameter updates in softmax regression.
Connection Between Softmax and Logistic Regression: Showing that logistic regression is a special case of softmax regression.
Inference in Multiclass Classification: Making predictions using the trained softmax regression model.
Decision Principles and Expected Risk: Justifying the use of maximum a posteriori (MAP) inference in classification.

In previous chapters, we discussed logistic regression for binary classification, focusing on modeling the probability of binary outcomes and optimizing the cross-entropy loss function. However, many real-world classification problems involve more than two classes. In this chapter, we extend the principles of logistic regression to multiclass classification using softmax regression.

Softmax regression, also known as multinomial logistic regression, generalizes logistic regression by modeling the probability distribution over multiple classes. It employs the softmax function to ensure that the predicted probabilities are positive and sum to one.

Limitations of Linear Regression for Classification

Mean Squared Error (MSE) Issues

Using linear regression with mean squared error for classification tasks is problematic for several reasons:

Inappropriate Loss Penalization: MSE does not penalize misclassifications adequately, especially when the predictions are confidently wrong.
Prediction Range: Linear regression can produce predictions outside the [0, 1] interval, which are not valid probabilities.
Symmetric Penalty: MSE applies the same penalty regardless of whether the prediction is overconfident or underconfident.

Comparison with Cross-Entropy Loss

Cross-Entropy Loss:
- Penalizes confident wrong predictions more heavily.
- Ensures that the loss goes to infinity as the predicted probability of the incorrect class approaches one.
Gradient Behavior:
- The gradient of the cross-entropy loss remains bounded, preventing numerical instability during optimization.
- The chain rule in differentiation helps in balancing the drastic increase in loss with the gradient magnitude.

Multiclass Classification

Problem Setup

Objective: Assign an input $\mathbf{x}$ to one of $K$ classes, labeled as $C_1, C_2, \dots, C_K$ .
Targets:
- Represented using one-hot encoding: $\mathbf{t} = [t_1, t_2, \dots, t_K], \quad t_i \in \{0, 1\}, \quad \sum_{i=1}^K t_i = 1$
- Alternatively, use an index representation where $t \in \{1, 2, \dots, K\}$ .

Issues with Linear Regression

Ordinal Encoding Problems: Assigning numerical values to classes (e.g., 1, 2, 3) introduces artificial ordering and distance relationships that do not exist between categories.
Mean Squared Error Limitations: Similar issues as in binary classification, but exacerbated due to multiple classes.

Softmax Function

Definition

The softmax function converts raw scores (logits) from a linear model into probabilities that sum to one.

Logits (Scores): $z_i = \mathbf{w}_i^\top \mathbf{x} + b_i, \quad i = 1, 2, \dots, K$
Softmax Function: $y_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}$
Properties:
- $y_i > 0$ for all $i$ .
- $\sum_{i=1}^K y_i = 1$ .

Interpretation

Probabilities: Each $y_i$ represents the model's estimated probability that the input $\mathbf{x}$ belongs to class $C_i$ .
Exponential Transformation: Ensures all outputs are positive.
Normalization: Dividing by the sum ensures the outputs sum to one.

Connection to Logistic Function

Binary Classification: The logistic (sigmoid) function is a special case of the softmax function when $K = 2$ .
Derivation:
- For $K = 2$ , the softmax probabilities reduce to: $y_1 = \frac{e^{z_1}}{e^{z_1} + e^{z_2}} = \frac{1}{1 + e^{-(z_1 - z_2)}}$ which is the sigmoid function applied to $z_1 - z_2$ .