Introduction to Logistic Regression
Logistic Regression is a type of regression in which returns the probability of occurrence of an event by fitting the data to a mathematical function called ‘logit function’. It is basically a classification algorithm and is used mostly when the dependent variable is categorical, the independent variables can be discrete or continuous.
Generalized Linear Models
Before starting with the equation for logistic regression, let us first understand the parent class of algorithms called Generalized Linear Models (GLM or GLIM). GLM is a larger class of models in which the response variable yi is assumed to follow an exponential family distribution with the mean μi. The model’s abbreviation GLIM became popular as it was the first widely used software package for fitting these models.
- Random component – It is referred to as the probability distribution of the response or dependent variable. Also known as the noise model or error model.
- Link Function – It is the mathematical function that explains the relationship between the random and systematic components.
- Systematic component – It explains the independent variables in the model
The table below summarizes the three components for all of the GLM models:
|Multinomial response||Multinomial||Generalized Logit||Mixed|
Logistic Regression Equation
The basic equation for the generalized linear model is:
g(E(y)) = α + βx1 + γx2
In the equation,
g(): link function
E(y): expected value of the target variable
α, β and γ: values which are to be predicted
Formulating logistic regression equation with the help of a simple example
To understand this further and formulate an equation for the logistic regression, let’s take an example:
Suppose a bank wants to predict the probability whether the customer will default credit card payment or not. Now, here the target variable is dichotomous, i.e. has two values, the person will either default (1) or will not default (0). So, we will use logistic regression here. For simplicity, I’ll only take three independent variables:
- age (a) – continuous
- educational qualification (e) – categorical
- last month’s payment status (p) – categorical
We’ll start by modifying the above basic GLM equation:
g(E(y)) = α + β1 (a) + β2 (e) + β3 (p) ---------(1)
In the equation, I’ve used ‘a‘ for age, ‘e’ for educational qualification and ‘p’ for last month’s payment status.
As discussed above, in logistic, we are only concerned with the probability of the outcome for the dependent variable. We use g(), link function, to get the probability of default (success, in this case, p) and the probability of not default (failure, q = 1-p). Now, the function should be able to meet the following criteria:
- p should be greater than or equal to 0
- p should be less than or equal to 1
As we need probability to be always greater than 0, we’ll use the exponential form of the linear equation.
g(y) = exp(α + β1 (a) + β2 (e) + β3 (p)) ---------(2) = e^(α + β1 (a) + β2 (e) + β3 (p)) ----------(3)
The exponential form takes care of the first condition (p>=0), now to satisfy the second condition (p<=1), we must divide the equation by a number greater than p. I’ll be using ‘p’ instead of g(y) as we are here trying to find out probability.
p = e^(α + β1 (a) + β2 (e) + β3 (p)) / (e^(α + β1 (a) + β2 (e) + β3 (p)) + 1) ---------(4)
Using 1, 2, 3 and 4, we get
p = e^y/ 1 + e^y ---------(5)
To make the equation simple, I’ve substituted, y = α + β1 (a) + β2 (e) + β3 (p)
Equation 5 is the Logit Function and p is the probability of success. We can find the probability of failure simply by 1-p, which is :
q = 1 - p = 1 - (e^y/ 1 + e^y ---------(6)
Dividing equation 5 / equation 6, we get:
p/1-p = ey -------- (7)
Now to remove the exponential, we take log on both sides. As a result, we get:
log(p/1-p) = y ---------(8)
substituting the value of y, we get:
log(p/1-p) = α + β1 (a) + β2 (e) + β3 (p) ---------(9)
Equation 9 is the actual Logistic Regression equation and (p/1-p) is the odd-ratio. Whenever the log value of the odd-ratio is positive, the probability of success is always greater than 50%.
In the next post, I’ll try to explain logistic regression implementation using R or Python.
Edit: I’ve added a post which explains Logistic Regression Implementation in R. You can access it here: https://analyticsdefined.com/implementing-logistic-regression-using-titanic-dataset-r/