Skip to main content

Softmax Regression

info

In this chapter, you'll learn about:

  • Limitations of Linear Regression for Classification: Understanding why mean squared error is not ideal for classification tasks.
  • Extension from Binary to Multiclass Classification: Generalizing logistic regression to handle multiple classes.
  • Softmax Function: Introducing the softmax function to model multiclass probabilities.
  • Cross-Entropy Loss for Multiclass Classification: Formulating the loss function suitable for multiclass settings.
  • Deriving the Gradient for Optimization: Computing gradients for parameter updates in softmax regression.
  • Connection Between Softmax and Logistic Regression: Showing that logistic regression is a special case of softmax regression.
  • Inference in Multiclass Classification: Making predictions using the trained softmax regression model.
  • Decision Principles and Expected Risk: Justifying the use of maximum a posteriori (MAP) inference in classification.

In previous chapters, we discussed logistic regression for binary classification, focusing on modeling the probability of binary outcomes and optimizing the cross-entropy loss function. However, many real-world classification problems involve more than two classes. In this chapter, we extend the principles of logistic regression to multiclass classification using softmax regression.

Softmax regression, also known as multinomial logistic regression, generalizes logistic regression by modeling the probability distribution over multiple classes. It employs the softmax function to ensure that the predicted probabilities are positive and sum to one.

Limitations of Linear Regression for Classification

Mean Squared Error (MSE) Issues

Using linear regression with mean squared error for classification tasks is problematic for several reasons:

  • Inappropriate Loss Penalization: MSE does not penalize misclassifications adequately, especially when the predictions are confidently wrong.
  • Prediction Range: Linear regression can produce predictions outside the [0, 1] interval, which are not valid probabilities.
  • Symmetric Penalty: MSE applies the same penalty regardless of whether the prediction is overconfident or underconfident.

Comparison with Cross-Entropy Loss

  • Cross-Entropy Loss:
    • Penalizes confident wrong predictions more heavily.
    • Ensures that the loss goes to infinity as the predicted probability of the incorrect class approaches one.
  • Gradient Behavior:
    • The gradient of the cross-entropy loss remains bounded, preventing numerical instability during optimization.
    • The chain rule in differentiation helps in balancing the drastic increase in loss with the gradient magnitude.

Multiclass Classification

Problem Setup

  • Objective: Assign an input x\mathbf{x} to one of KK classes, labeled as C1,C2,,CKC_1, C_2, \dots, C_K.
  • Targets:
    • Represented using one-hot encoding: t=[t1,t2,,tK],ti{0,1},i=1Kti=1\mathbf{t} = [t_1, t_2, \dots, t_K], \quad t_i \in \{0, 1\}, \quad \sum_{i=1}^K t_i = 1
    • Alternatively, use an index representation where t{1,2,,K}t \in \{1, 2, \dots, K\}.

Issues with Linear Regression

  • Ordinal Encoding Problems: Assigning numerical values to classes (e.g., 1, 2, 3) introduces artificial ordering and distance relationships that do not exist between categories.
  • Mean Squared Error Limitations: Similar issues as in binary classification, but exacerbated due to multiple classes.

Softmax Function

Definition

The softmax function converts raw scores (logits) from a linear model into probabilities that sum to one.

  • Logits (Scores): zi=wix+bi,i=1,2,,Kz_i = \mathbf{w}_i^\top \mathbf{x} + b_i, \quad i = 1, 2, \dots, K
  • Softmax Function: yi=ezij=1Kezjy_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}
  • Properties:
    • yi>0y_i > 0 for all ii.
    • i=1Kyi=1\sum_{i=1}^K y_i = 1.

Interpretation

  • Probabilities: Each yiy_i represents the model's estimated probability that the input x\mathbf{x} belongs to class CiC_i.
  • Exponential Transformation: Ensures all outputs are positive.
  • Normalization: Dividing by the sum ensures the outputs sum to one.

Connection to Logistic Function

  • Binary Classification: The logistic (sigmoid) function is a special case of the softmax function when K=2K = 2.
  • Derivation:
    • For K=2K = 2, the softmax probabilities reduce to: y1=ez1ez1+ez2=11+e(z1z2)y_1 = \frac{e^{z_1}}{e^{z_1} + e^{z_2}} = \frac{1}{1 + e^{-(z_1 - z_2)}} which is the sigmoid function applied to z1z2z_1 - z_2.

Softmax Regression Model

Model Formulation

  • Scores: zi=wix+biz_i = \mathbf{w}_i^\top \mathbf{x} + b_i
  • Predicted Probabilities: yi=ezij=1Kezjy_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}
  • Parameterization:
    • Weights: wi\mathbf{w}_i for each class CiC_i.
    • Biases: bib_i for each class CiC_i.

Decision Rule

  • Predicted Class: t^=argmaxi  yi\hat{t} = \underset{i}{\arg\max} \; y_i
  • Inference: Choose the class with the highest predicted probability.

Training Softmax Regression via Maximum Likelihood Estimation

Training Data

  • Dataset: D={(x(n),t(n))}n=1ND = \{ (\mathbf{x}^{(n)}, \mathbf{t}^{(n)}) \}_{n=1}^N, where t(n)\mathbf{t}^{(n)} is a one-hot encoded vector.

Likelihood Function

  • Assumption: Observations are independent.
  • Likelihood: L(Θ)=n=1Ni=1K[yi(n)]ti(n)L(\Theta) = \prod_{n=1}^N \prod_{i=1}^K [y_i^{(n)}]^{t_i^{(n)}} where Θ\Theta represents all model parameters.

Log-Likelihood Function

  • Log-Likelihood: (Θ)=n=1Ni=1Kti(n)logyi(n)\ell(\Theta) = \sum_{n=1}^N \sum_{i=1}^K t_i^{(n)} \log y_i^{(n)}
  • Objective: Maximize (Θ)\ell(\Theta).

Cross-Entropy Loss

  • Loss Function: J(Θ)=(Θ)=n=1Ni=1Kti(n)logyi(n)J(\Theta) = -\ell(\Theta) = -\sum_{n=1}^N \sum_{i=1}^K t_i^{(n)} \log y_i^{(n)}
  • Per-Sample Loss: J(n)=i=1Kti(n)logyi(n)=logyt(n)(n)J^{(n)} = -\sum_{i=1}^K t_i^{(n)} \log y_i^{(n)} = -\log y_{t^{(n)}}^{(n)} where t(n)t^{(n)} is the index of the true class.

Interpretation

  • The cross-entropy loss measures the difference between the true distribution t\mathbf{t} and the predicted distribution y\mathbf{y}.
  • Penalizes incorrect predictions more heavily when the model is confident but wrong.

Gradient Computation for Optimization

Need for Gradient

  • Purpose: Use gradient-based optimization methods (e.g., gradient descent) to minimize the loss function.
  • Challenge: No closed-form solution for the optimal parameters.

Computing the Gradient

  • Gradient w.r.t Weights wi\mathbf{w}_i: wiJ=n=1N(yi(n)ti(n))x(n)\nabla_{\mathbf{w}_i} J = \sum_{n=1}^N (y_i^{(n)} - t_i^{(n)}) \mathbf{x}^{(n)}
  • Gradient w.r.t Biases bib_i: Jbi=n=1N(yi(n)ti(n))\frac{\partial J}{\partial b_i} = \sum_{n=1}^N (y_i^{(n)} - t_i^{(n)})
  • Derivation Highlights:
    • Softmax Derivative: yizj=yi(δijyj)\frac{\partial y_i}{\partial z_j} = y_i (\delta_{ij} - y_j) where δij\delta_{ij} is the Kronecker delta.
    • Chain Rule: Applied to compute derivatives of composite functions.

Matrix Notation

  • Gradient Compact Form: WJ=X(YT)\nabla_{\mathbf{W}} J = \mathbf{X}^\top (\mathbf{Y} - \mathbf{T}) where:
    • W\mathbf{W}: Matrix of weights with columns wi\mathbf{w}_i.
    • X\mathbf{X}: Design matrix (stacked input vectors).
    • Y\mathbf{Y}: Matrix of predicted probabilities.
    • T\mathbf{T}: Matrix of true labels (one-hot encoded).

Optimization Methods

  • Gradient Descent Variants:
    • Batch Gradient Descent.
    • Stochastic Gradient Descent (SGD).
    • Mini-Batch Gradient Descent.
  • Regularization: Add penalty terms to prevent overfitting (e.g., L2 regularization).

Connection Between Softmax and Logistic Regression

Logistic Regression as a Special Case

  • Binary Classification: When K=2K = 2, softmax regression reduces to logistic regression.
  • Derivation:
    • Softmax Probabilities: y1=ez1ez1+ez2,y2=ez2ez1+ez2y_1 = \frac{e^{z_1}}{e^{z_1} + e^{z_2}}, \quad y_2 = \frac{e^{z_2}}{e^{z_1} + e^{z_2}}
    • Simplify y1y_1: y1=11+e(z1z2)y_1 = \frac{1}{1 + e^{-(z_1 - z_2)}} which is the sigmoid function applied to z1z2z_1 - z_2.
  • Parameter Equivalence:
    • Let w=w1w2\mathbf{w} = \mathbf{w}_1 - \mathbf{w}_2.
    • Let b=b1b2b = b_1 - b_2.

Implications

  • The softmax function generalizes the logistic (sigmoid) function to multiple classes.
  • Understanding this connection helps in grasping the underlying principles of classification models.

Inference in Multiclass Classification

Making Predictions

  • Compute Scores: zi=wix+biz_i = \mathbf{w}_i^\top \mathbf{x} + b_i
  • Compute Probabilities: yi=ezij=1Kezjy_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}
  • Predict Class: t^=argmaxi  yi\hat{t} = \underset{i}{\arg\max} \; y_i

Decision Principle

  • Maximum A Posteriori (MAP) Inference: Choose the class with the highest posterior probability given the input.
  • Justification:
    • Minimizes the expected error under the zero-one loss function.
    • Equivalent to maximizing accuracy.

Expected Risk Minimization

  • Expected Error for a Given x\mathbf{x}: Etp(tx)[Error]=1p(t^x)\mathbb{E}_{t \sim p(t \mid \mathbf{x})} [\text{Error}] = 1 - p(\hat{t} \mid \mathbf{x})
  • Minimizing Expected Error:
    • By choosing t^=argmaxi  p(t=ix)\hat{t} = \underset{i}{\arg\max} \; p(t = i \mid \mathbf{x}), we minimize the expected error for each x\mathbf{x}.
    • Taking the expectation over x\mathbf{x} confirms that this decision rule minimizes the overall expected error.

Practical Considerations

Numerical Stability

  • Computing Exponentials: Large values of ziz_i can cause numerical overflow.
  • Stabilization Technique:
    • Subtract the maximum zmaxz_{\text{max}} from all ziz_i: yi=ezizmaxj=1Kezjzmaxy_i = \frac{e^{z_i - z_{\text{max}}}}{\sum_{j=1}^K e^{z_j - z_{\text{max}}}}
    • This does not change the probabilities but prevents numerical issues.

Implementation Tips

  • Cross-Entropy Loss Functions: Use built-in functions in machine learning libraries that handle numerical stability.
  • Avoiding Log of Zero: Ensure that predicted probabilities are clipped to a small positive value before taking the logarithm.

Regularization

  • L2 Regularization:
    • Add λ2i=1Kwi22\frac{\lambda}{2} \sum_{i=1}^K \| \mathbf{w}_i \|_2^2 to the loss function.
    • Helps prevent overfitting, especially in high-dimensional spaces.

Conclusion

Softmax regression extends logistic regression to multiclass classification by modeling a probability distribution over multiple classes using the softmax function. The cross-entropy loss function is used to train the model via maximum likelihood estimation. Gradient-based optimization methods are employed to find the optimal parameters.

Understanding softmax regression is crucial for tackling multiclass classification problems and serves as a foundation for more advanced models, such as neural networks and deep learning architectures.

Recap

  • Limitations of Linear Regression for Classification: Mean squared error is not suitable for classification tasks.
  • Softmax Function: Converts logits into probabilities that sum to one.
  • Softmax Regression Model: Generalizes logistic regression to multiple classes.
  • Cross-Entropy Loss: Used for training softmax regression via maximum likelihood estimation.
  • Gradient Computation: Essential for optimization using gradient descent methods.
  • Connection with Logistic Regression: Logistic regression is a special case of softmax regression when K=2K = 2.
  • Inference: Use the maximum posterior probability to make predictions.
  • Decision Principle: MAP inference minimizes expected error under the zero-one loss function.