Logistic regression

Logistic Regression

Logistic regression is a statistical method used to predict the probability of a binary outcome (0 or 1, yes or no, true or false) based on one or more predictor variables. Despite its name, it's a classification algorithm, not a regression algorithm, because the dependent variable is categorical. It's widely used in fields like Machine learning, Data science, medical diagnosis, credit risk assessment, and marketing. This article will provide a comprehensive introduction to logistic regression, covering its underlying principles, mathematical formulation, interpretation of results, and practical considerations.

Understanding the Need for Logistic Regression

Traditional Linear regression is suitable for predicting continuous outcomes. However, when the outcome variable is binary, linear regression can produce predictions outside the 0-1 range, which are not meaningful as probabilities. Consider trying to predict whether a customer will click on an ad (yes/no). A linear regression model might predict a value of -0.2 or 1.5, which cannot be interpreted as a probability.

Logistic regression addresses this limitation by applying a sigmoid function (also known as the logistic function) to the linear combination of predictor variables. This function maps any real-valued number to a value between 0 and 1, representing the probability of the outcome being 1.

The Sigmoid Function

The sigmoid function is the core of logistic regression. It's defined as:

σ(z) = 1 / (1 + e^-z)

Where:

σ(z) is the sigmoid function.
z is the linear combination of predictor variables (z = β₀ + β₁x₁ + β₂x₂ + ... + β_nx_n).
e is the base of the natural logarithm (approximately 2.71828).

The sigmoid function has a characteristic "S" shape. As z approaches positive infinity, σ(z) approaches 1. As z approaches negative infinity, σ(z) approaches 0. When z = 0, σ(z) = 0.5. This means that the sigmoid function provides a smooth transition between 0 and 1, making it ideal for representing probabilities.

Mathematical Formulation

Let’s define the following:

y_i : The binary outcome for observation i (0 or 1).
x_i1, x_i2, ..., x_in : The values of the predictor variables for observation i.
β₀, β₁, β₂, ..., β_n : The regression coefficients (parameters) to be estimated.

The probability of the outcome being 1 for observation i, denoted as P(y_i = 1 | x_i1, x_i2, ..., x_in) is modeled as:

P(y_i = 1 | x_i1, x_i2, ..., x_in) = σ(z_i) = 1 / (1 + e^{-(β₀ + β₁x_i1 + β₂x_i2 + ... + β_nx_in)})

This equation gives the probability of success (y=1) given the predictor variables. The probability of failure (y=0) is simply:

P(y_i = 0 | x_i1, x_i2, ..., x_in) = 1 - P(y_i = 1 | x_i1, x_i2, ..., x_in)

Estimating the Regression Coefficients

The goal is to find the values of the regression coefficients (β₀, β₁, ..., β_n) that best fit the observed data. This is typically done using the Maximum likelihood estimation (MLE) method.

The likelihood function measures how well the model fits the data. The MLE method seeks to find the coefficients that maximize the likelihood function. The log-likelihood function is often used instead of the likelihood function to simplify the calculations. The resulting equations do not have a closed-form solution and require iterative optimization algorithms such as:

Gradient Descent
Newton-Raphson
Iteratively Reweighted Least Squares (IRLS)

Statistical software packages like R, Python (with libraries like scikit-learn), and SPSS automatically handle the estimation process.

Interpreting the Results

Once the regression coefficients are estimated, they need to be interpreted. Unlike linear regression, the coefficients in logistic regression do not directly represent the change in the outcome variable for a one-unit change in the predictor variable. Instead, they represent the change in the *log-odds* of the outcome being 1 for a one-unit change in the predictor variable.

The *odds* are defined as the ratio of the probability of success to the probability of failure:

Odds = P(y = 1) / P(y = 0) = P(y = 1) / (1 - P(y = 1))

The *log-odds* (also known as the logit) is the natural logarithm of the odds:

Log-odds = ln(Odds) = ln(P(y = 1) / (1 - P(y = 1)))

Therefore, a coefficient of β₁ means that a one-unit increase in x₁ changes the log-odds of y being 1 by β₁.

To interpret the coefficients more intuitively, we can exponentiate them:

exp(β₁) represents the *odds ratio*. This is the multiplicative change in the odds of y being 1 for a one-unit increase in x₁.

If exp(β₁) > 1, the odds of y being 1 increase with x₁.
If exp(β₁) < 1, the odds of y being 1 decrease with x₁.
If exp(β₁) = 1, x₁ has no effect on the odds of y being 1.

The intercept term (β₀) represents the log-odds of y being 1 when all predictor variables are equal to zero.

Assessing Model Performance

Several metrics can be used to assess the performance of a logistic regression model:

Confusion Matrix : A table that summarizes the model's predictions and actual outcomes, showing true positives, true negatives, false positives, and false negatives.
Accuracy : The proportion of correctly classified observations: (True Positives + True Negatives) / Total Observations.
Precision : The proportion of correctly predicted positive cases out of all predicted positive cases: True Positives / (True Positives + False Positives). Important in scenarios like Fraud detection.
Recall (Sensitivity) : The proportion of correctly predicted positive cases out of all actual positive cases: True Positives / (True Positives + False Negatives). Crucial in applications like Medical diagnosis.
F1-Score : The harmonic mean of precision and recall: 2 * (Precision * Recall) / (Precision + Recall).
AUC (Area Under the ROC Curve) : A measure of the model's ability to discriminate between positive and negative cases. An AUC of 0.5 indicates random guessing, while an AUC of 1 indicates perfect discrimination.
Log Loss (Cross-Entropy Loss) : Measures the performance of a classification model where the prediction input is a probability value between 0 and 1.

Practical Considerations and Assumptions

Several assumptions underlie logistic regression:

Linearity in the Logit : The relationship between the predictor variables and the log-odds of the outcome is linear. This can be checked by examining residual plots.
Independence of Errors : The errors (residuals) are independent of each other. This can be violated in time series data or clustered data.
No Multicollinearity : The predictor variables are not highly correlated with each other. Multicollinearity can make it difficult to interpret the coefficients. Tools like Variance Inflation Factor (VIF) can help detect multicollinearity.
Large Sample Size : Logistic regression typically requires a relatively large sample size to ensure stable and reliable estimates. A general rule of thumb is to have at least 10 events (observations with y=1) per predictor variable.
Outliers : Outliers can disproportionately influence the estimates. It's important to identify and address outliers appropriately.

Extensions of Logistic Regression

Several extensions of logistic regression address limitations or extend its applicability:

Multinomial Logistic Regression : Used when the outcome variable has more than two categories (e.g., predicting customer segment). Useful in Market segmentation.
Ordinal Logistic Regression : Used when the outcome variable has ordered categories (e.g., rating a product on a scale of 1 to 5).
Regularized Logistic Regression (L1 and L2 Regularization) : Used to prevent overfitting, especially when dealing with high-dimensional data. These techniques add a penalty term to the log-likelihood function, discouraging large coefficients. Related to Risk management.
Poisson Regression : Used when the outcome variable represents count data (e.g., the number of events occurring in a given time period). Relevant to Event study.