Skip to main content

Logistic Regression

info

In this chapter, you'll learn about:

  • Binary Classification with Probabilistic Models: Modeling binary outcomes using probabilities.
  • Bernoulli Distribution: Understanding the distribution for binary random variables.
  • Logistic Function (Sigmoid Function): Introducing the squashing function to map linear combinations to probabilities.
  • Logistic Regression Model: Formulating the logistic regression for binary classification.
  • Maximum Likelihood Estimation (MLE): Deriving the loss function for logistic regression.
  • Cross-Entropy Loss: Connecting the logistic regression loss to cross-entropy and KL divergence.
  • Gradient Computation: Calculating gradients for optimization.
  • Convexity and Optimization: Discussing the convex nature of logistic regression and optimization methods.

In previous chapters, we introduced classification problems and explored linear classifiers. We discussed the limitations of using linear regression for classification and the need for models specifically designed for categorical outcomes.

In this chapter, we delve into logistic regression, a fundamental algorithm for binary classification tasks. Logistic regression models the probability that a given input belongs to a particular category, allowing for probabilistic interpretation of predictions. It is widely used due to its simplicity, interpretability, and effectiveness.

Binary Classification and the Bernoulli Distribution

Binary Classification Recap

  • Objective: Assign an input x\mathbf{x} to one of two classes, labeled as 0 or 1.
  • Examples: Spam detection (spam or not spam), disease diagnosis (disease or healthy).

Bernoulli Distribution

  • Definition: A discrete probability distribution for a random variable that has two possible outcomes, 1 (success) and 0 (failure).
  • Parameter: π\pi, where 0π10 \leq \pi \leq 1.
  • Probability Mass Function: p(t)={πif t=11πif t=0p(t) = \begin{cases} \pi & \text{if } t = 1 \\ 1 - \pi & \text{if } t = 0 \end{cases}
  • Use in Classification: Models the probability that the target variable tt equals 1.

Modeling the Probability with Inputs

  • Goal: Model π\pi as a function of the input features x\mathbf{x}.
  • Linear Combination: Compute a linear combination z=wx+bz = \mathbf{w}^\top \mathbf{x} + b.
  • Issue: The linear combination zz can take any real value, but π\pi must be between 0 and 1.

The Logistic Function (Sigmoid Function)

Need for a Squashing Function

  • Purpose: Map the linear combination zz to a value between 0 and 1.
  • Requirements:
    • Monotonic increasing function.
    • Outputs values strictly between 0 and 1.

Logistic (Sigmoid) Function

  • Definition:

    σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}
  • Properties:

    • Range: 0<σ(z)<10 < \sigma(z) < 1 for all real zz.
    • S-shape Curve: As zz \rightarrow \infty, σ(z)1\sigma(z) \rightarrow 1; as zz \rightarrow -\infty, σ(z)0\sigma(z) \rightarrow 0.
    • Symmetry: σ(z)=1σ(z)\sigma(-z) = 1 - \sigma(z).
  • Visualization:

    Sigmoid Function

Alternative Functions

  • Probit Function: Based on the cumulative distribution function (CDF) of the normal distribution.
    • Definition: Φ(z)=z12πet22dt\Phi(z) = \int_{-\infty}^{z} \frac{1}{\sqrt{2\pi}} e^{-\frac{t^2}{2}} dt
    • Used in: Probit regression.
  • Why Logistic Function?
    • Mathematical Convenience: The logistic function leads to a convex loss function and simplifies computation.
    • Interpretability: Provides a probabilistic interpretation of the output.

Logistic Regression Model

Model Formulation

  • Probability of Class 1: p(t=1x)=σ(z)=11+e(wx+b)p(t = 1 \mid \mathbf{x}) = \sigma(z) = \frac{1}{1 + e^{-(\mathbf{w}^\top \mathbf{x} + b)}}
  • Probability of Class 0: p(t=0x)=1σ(z)p(t = 0 \mid \mathbf{x}) = 1 - \sigma(z)
  • Interpretation: The logistic function maps the linear combination of inputs to a probability.

Decision Rule

  • Predicted Class: t^={1if σ(z)0.50otherwise\hat{t} = \begin{cases} 1 & \text{if } \sigma(z) \geq 0.5 \\ 0 & \text{otherwise} \end{cases}
  • Equivalently: t^={1if wx+b00otherwise\hat{t} = \begin{cases} 1 & \text{if } \mathbf{w}^\top \mathbf{x} + b \geq 0 \\ 0 & \text{otherwise} \end{cases}

Terminology

  • Goodness Score (Logit): z=wx+bz = \mathbf{w}^\top \mathbf{x} + b.
  • Soft Output: y=σ(z)y = \sigma(z), the predicted probability.
  • Hard Output: t^\hat{t}, the predicted class label.

Naming Convention

  • Despite being called "regression," logistic regression is used for classification tasks. The name originates from its historical development in statistics.

Training Logistic Regression via Maximum Likelihood Estimation

Training Data

  • Dataset: D={(x(i),t(i))}i=1ND = \{ (\mathbf{x}^{(i)}, t^{(i)}) \}_{i=1}^N, where t(i){0,1}t^{(i)} \in \{0, 1\}.

Likelihood Function

  • Assumption: Observations are independent.
  • Likelihood: L(w,b)=i=1Np(t(i)x(i);w,b)L(\mathbf{w}, b) = \prod_{i=1}^N p(t^{(i)} \mid \mathbf{x}^{(i)}; \mathbf{w}, b)
  • Using the Model: p(t(i)x(i);w,b)=[σ(z(i))]t(i)[1σ(z(i))]1t(i)p(t^{(i)} \mid \mathbf{x}^{(i)}; \mathbf{w}, b) = [\sigma(z^{(i)})]^{t^{(i)}} [1 - \sigma(z^{(i)})]^{1 - t^{(i)}} where z(i)=wx(i)+bz^{(i)} = \mathbf{w}^\top \mathbf{x}^{(i)} + b.

Log-Likelihood Function

  • Log-Likelihood: (w,b)=i=1N[t(i)logσ(z(i))+(1t(i))log(1σ(z(i)))]\ell(\mathbf{w}, b) = \sum_{i=1}^N \left[ t^{(i)} \log \sigma(z^{(i)}) + (1 - t^{(i)}) \log (1 - \sigma(z^{(i)})) \right]
  • Objective: Maximize (w,b)\ell(\mathbf{w}, b).

Loss Function

  • Negative Log-Likelihood (Cross-Entropy Loss): J(w,b)=(w,b)=i=1N[t(i)logσ(z(i))+(1t(i))log(1σ(z(i)))]J(\mathbf{w}, b) = -\ell(\mathbf{w}, b) = -\sum_{i=1}^N \left[ t^{(i)} \log \sigma(z^{(i)}) + (1 - t^{(i)}) \log (1 - \sigma(z^{(i)})) \right]
  • Purpose: Convert maximization problem into minimization.

Interpretation as Cross-Entropy and KL Divergence

  • Cross-Entropy Loss: Measures the difference between two probability distributions.
  • KL Divergence: DKL(TY)=it(i)logt(i)y(i)+(1t(i))log1t(i)1y(i)D_{\text{KL}}(T \parallel Y) = \sum_{i} t^{(i)} \log \frac{t^{(i)}}{y^{(i)}} + (1 - t^{(i)}) \log \frac{1 - t^{(i)}}{1 - y^{(i)}}
  • Relation: The logistic regression loss is proportional to the KL divergence between the true distribution TT and the predicted distribution YY.

Gradient Computation for Optimization

Need for Gradient

  • Purpose: Use gradient-based optimization methods (e.g., gradient descent) to minimize the loss function.
  • Challenge: The loss function is convex but does not have a closed-form solution for w\mathbf{w} and bb.

Computing the Gradient

  • Gradient w.r.t Weights w\mathbf{w}: wJ=i=1N(y(i)t(i))x(i)\nabla_{\mathbf{w}} J = \sum_{i=1}^N (y^{(i)} - t^{(i)}) \mathbf{x}^{(i)} where y(i)=σ(z(i))y^{(i)} = \sigma(z^{(i)}).
  • Gradient w.r.t Bias bb: Jb=i=1N(y(i)t(i))\frac{\partial J}{\partial b} = \sum_{i=1}^N (y^{(i)} - t^{(i)})
  • Derivation Highlights:
    • Chain Rule: Used to compute derivatives of composite functions.
    • Sigmoid Derivative: σ(z)=σ(z)(1σ(z))\sigma'(z) = \sigma(z)(1 - \sigma(z))
    • Simplification: The gradients simplify to expressions involving y(i)t(i)y^{(i)} - t^{(i)}.

Matrix Notation

  • Gradient Compact Form: wJ=X(yt)\nabla_{\mathbf{w}} J = \mathbf{X}^\top (\mathbf{y} - \mathbf{t}) where:
    • X\mathbf{X}: Design matrix (stacked input vectors).
    • y\mathbf{y}: Vector of predicted probabilities.
    • t\mathbf{t}: Vector of true labels.

Similarity to Linear Regression

  • Linear Regression Gradient: wJLR=X(Xwt)\nabla_{\mathbf{w}} J_{\text{LR}} = \mathbf{X}^\top (\mathbf{X} \mathbf{w} - \mathbf{t})
  • Observation: Logistic regression gradient resembles that of linear regression but with y\mathbf{y} instead of Xw\mathbf{X} \mathbf{w}.

Optimization Methods

No Closed-Form Solution

  • Unlike linear regression, logistic regression does not have a closed-form solution for w\mathbf{w} and bb.
  • Reason: The sigmoid function introduces nonlinearity.

Gradient-Based Optimization

  • Methods:
    • Batch Gradient Descent: Updates parameters using the entire dataset.
    • Stochastic Gradient Descent (SGD): Updates parameters using one sample at a time.
    • Mini-Batch Gradient Descent: Updates parameters using subsets of the data.
  • Algorithm:
    1. Initialize w\mathbf{w} and bb.
    2. Compute the gradients wJ\nabla_{\mathbf{w}} J and Jb\frac{\partial J}{\partial b}.
    3. Update parameters: wwηwJbbηJb\mathbf{w} \leftarrow \mathbf{w} - \eta \nabla_{\mathbf{w}} J \\ b \leftarrow b - \eta \frac{\partial J}{\partial b} where η\eta is the learning rate.
    4. Repeat until convergence.

Convexity of the Loss Function

  • Property: The logistic regression loss function is convex.
  • Implication: Any local minimum is the global minimum.
  • Benefit: Guarantees that gradient-based methods will converge to the optimal solution (given appropriate learning rate and convergence criteria).

Regularization in Logistic Regression

Need for Regularization

  • Purpose: Prevent overfitting by penalizing large weights.
  • Approach: Add a regularization term to the loss function.

L2 Regularization (Ridge)

  • Regularized Loss Function: Jreg(w,b)=J(w,b)+λ2w22J_{\text{reg}}(\mathbf{w}, b) = J(\mathbf{w}, b) + \frac{\lambda}{2} \| \mathbf{w} \|_2^2
  • Interpretation: Encourages smaller weights.

L1 Regularization (Lasso)

  • Regularized Loss Function: Jreg(w,b)=J(w,b)+λw1J_{\text{reg}}(\mathbf{w}, b) = J(\mathbf{w}, b) + \lambda \| \mathbf{w} \|_1
  • Interpretation: Encourages sparsity in the weights.

Impact on Gradient

  • Modified Gradient:
    • For L2: wJreg=wJ+λw\nabla_{\mathbf{w}} J_{\text{reg}} = \nabla_{\mathbf{w}} J + \lambda \mathbf{w}.
    • For L1: Gradient is less straightforward due to the absolute value.

Logistic Regression as a Generalized Linear Model (GLM)

Connection to GLMs

  • GLMs: Extend linear models to allow the dependent variable to have a non-normal distribution.
  • Components:
    • Random Component: Specifies the distribution of the response variable (e.g., Bernoulli).
    • Systematic Component: Linear predictor η=wx+b\eta = \mathbf{w}^\top \mathbf{x} + b.
    • Link Function: Connects the mean of the distribution to the linear predictor (e.g., logistic function).
  • Definition: The link function that leads to desirable mathematical properties.
  • For Logistic Regression: The logistic function is the canonical link function for the Bernoulli distribution.

Practical Considerations

Feature Scaling

  • Importance: Helps in faster convergence of gradient-based methods.
  • Methods:
    • Standardization (zero mean, unit variance).
    • Normalization (scaling features to a specific range).

Choice of Learning Rate

  • Trade-off:
    • Too Large: May cause the algorithm to diverge.
    • Too Small: Slow convergence.
  • Adaptive Methods: Algorithms like Adam, RMSProp adjust the learning rate during training.

Handling Imbalanced Data

  • Issue: Class imbalance can bias the model toward the majority class.
  • Solutions:
    • Resampling techniques (oversampling minority class, undersampling majority class).
    • Using evaluation metrics suitable for imbalanced data (precision, recall, F1-score).

Evaluation Metrics

  • Accuracy: May be misleading with imbalanced data.
  • Confusion Matrix: Provides detailed insights.
  • ROC Curve and AUC: Evaluate the trade-off between true positive rate and false positive rate.
  • Precision-Recall Curve: More informative with imbalanced datasets.

Conclusion

Logistic regression is a powerful and widely used algorithm for binary classification tasks. By modeling the probability of class membership using the logistic function, it provides both a probabilistic framework and a linear decision boundary.

Understanding logistic regression lays the foundation for more advanced classification algorithms and deep learning models. Its principles are fundamental in machine learning and are essential knowledge for any practitioner.

Recap

  • Bernoulli Distribution: Used for modeling binary outcomes.
  • Logistic Function: Maps linear combinations to probabilities between 0 and 1.
  • Logistic Regression Model: Predicts the probability of class membership.
  • Maximum Likelihood Estimation: Used to derive the loss function.
  • Cross-Entropy Loss: The negative log-likelihood function for logistic regression.
  • Gradient Computation: Necessary for optimizing the loss function.
  • Convexity: Ensures that gradient-based methods converge to the global minimum.
  • Regularization: Prevents overfitting by penalizing large weights.
  • GLMs: Logistic regression is a special case of generalized linear models.