Logistic Regression

info

In this chapter, you'll learn about:

Binary Classification with Probabilistic Models: Modeling binary outcomes using probabilities.
Bernoulli Distribution: Understanding the distribution for binary random variables.
Logistic Function (Sigmoid Function): Introducing the squashing function to map linear combinations to probabilities.
Logistic Regression Model: Formulating the logistic regression for binary classification.
Maximum Likelihood Estimation (MLE): Deriving the loss function for logistic regression.
Cross-Entropy Loss: Connecting the logistic regression loss to cross-entropy and KL divergence.
Gradient Computation: Calculating gradients for optimization.
Convexity and Optimization: Discussing the convex nature of logistic regression and optimization methods.

In previous chapters, we introduced classification problems and explored linear classifiers. We discussed the limitations of using linear regression for classification and the need for models specifically designed for categorical outcomes.

In this chapter, we delve into logistic regression, a fundamental algorithm for binary classification tasks. Logistic regression models the probability that a given input belongs to a particular category, allowing for probabilistic interpretation of predictions. It is widely used due to its simplicity, interpretability, and effectiveness.

Binary Classification and the Bernoulli Distribution

Binary Classification Recap

Objective: Assign an input $\mathbf{x}$ to one of two classes, labeled as 0 or 1.
Examples: Spam detection (spam or not spam), disease diagnosis (disease or healthy).

Bernoulli Distribution

Definition: A discrete probability distribution for a random variable that has two possible outcomes, 1 (success) and 0 (failure).
Parameter: $\pi$ , where $0 \leq \pi \leq 1$ .
Probability Mass Function: $p(t) = \begin{cases} \pi & \text{if } t = 1 \\ 1 - \pi & \text{if } t = 0 \end{cases}$
Use in Classification: Models the probability that the target variable $t$ equals 1.

Modeling the Probability with Inputs

Goal: Model $\pi$ as a function of the input features $\mathbf{x}$ .
Linear Combination: Compute a linear combination $z = \mathbf{w}^\top \mathbf{x} + b$ .
Issue: The linear combination $z$ can take any real value, but $\pi$ must be between 0 and 1.

The Logistic Function (Sigmoid Function)

Need for a Squashing Function

Purpose: Map the linear combination $z$ to a value between 0 and 1.
Requirements:
- Monotonic increasing function.
- Outputs values strictly between 0 and 1.

Logistic (Sigmoid) Function

Definition:
$\sigma(z) = \frac{1}{1 + e^{-z}}$
Properties:
- Range: $0 < \sigma(z) < 1$ for all real $z$ .
- S-shape Curve: As $z \rightarrow \infty$ , $\sigma(z) \rightarrow 1$ ; as $z \rightarrow -\infty$ , $\sigma(z) \rightarrow 0$ .
- Symmetry: $\sigma(-z) = 1 - \sigma(z)$ .
Visualization:

Alternative Functions

Probit Function: Based on the cumulative distribution function (CDF) of the normal distribution.
- Definition: $\Phi(z) = \int_{-\infty}^{z} \frac{1}{\sqrt{2\pi}} e^{-\frac{t^2}{2}} dt$
- Used in: Probit regression.
Why Logistic Function?
- Mathematical Convenience: The logistic function leads to a convex loss function and simplifies computation.
- Interpretability: Provides a probabilistic interpretation of the output.