Logistic Regression Explained: Maximum Likelihood Estimation (MLE)
Logistic Regression is a classification algorithm for Statistical learning, like deciding if an email is a spam or not. It can be used for both binary and multi-class responses though it’s primarily utilized for Binary Classification tasks due to its faster computation speed compared to other tree-based models.
Note: In this article,
ln
andlog
are used interchangably. Both represent the natural logarithm i.e.log with e
.
Summary
While somewhat more complex than traditional Linear Regression, Logistic Regression follows a similar methodology for determining the best fit. The objective is to define a Cost/Loss Function, the minimization of which will guide us toward achieving optimal fitting. The Cost/Loss Function for Logistic Regression is given below —
In this article, I will explain the mathematical intuition behind Logistic Regression and how it uses Maximum Likelihood Estimation (MLE) to find the best fit. I’ll also explain how this Cost/Loss Function is formulated. So, without any further ado, let’s dive straight into it.
Explanation
Suppose we have a dataset that has one predictor physical_score
and one response test_result
. The data is collected from an experiment conducted on 5000 participants to study the effects of overall physical health on hearing loss, specifically the ability to hear high-pitched tones. The response variable test_result
has two discrete categories 1 and 0.
test_result = 1
— the participant could hear the high-pitched tone.test_result = 0
— the participant couldn’t hear the high-pitched tone.
Transition from Linear Regression to Logistic Regression
We previously discussed in Linear Regression that we could model the response variable as —
As we know, Linear Regression is modeled to estimate a continuous response variable. On the contrary — in this case, the response variable is a discrete [0, 1] one — that’s where the probability comes into play.
Concept of Probability
Instead of modeling the response Y
directly, Logistic Regression models the probability of Y
belonging to a particular category. In our case, we’ll model the probability of Y
belonging to test_result = 1
.
But there’s a catch — if we directly try to model the probability of the response of belonging to a particular category off the predictor(s), we’ll realize the linear regression modeling cannot ensure that the predicted probabilities will stay between 0 and 1, as a probability must.
If we try to visualize the modeling, we’ll get a graph like this —
Here, we can clearly see that the fitted model can’t ensure that
Pr
will end up between 0 and 1. Hence, we can’t use this linear fit directly.
Setting-up a cut-off
So, if we manage to squash the Y-axis in the range (0,1) and set a cut-off at 0.5, we’ll be able to separate the responses as 1s and 0s. We can set it as — if the probability of belonging to the category 1
is less than 0.5, then the response will be set to0
.
Although many mathematical functions can do the same, in Logistic Regression, we use Logistic or Sigmoid Function for the same.
Sigmoid/Logistic Function
The Sigmoid Function is a mathematical function that maps any real-valued number to a value between 0 and 1.
So, we can model the probability by applying the Sigmoid Function that will squash all values between 0 and 1 to solve the earlier issue.
Odds
In terms of probabilities, odds are the probability of an event occurring divided by the probability of that event not occurring. Suppose, the probability of your team winning the game is 0.5 (i.e. 50%), then the probability of your team not winning is also 0.5 (1–0.5 = 0.5).
Hence, the odds of your team winning the game are 1. Also, it can be expressed as a ratio of 1:1.
If we simplify the equation —
And then get rid of the e
,
We simplified the equation and to get rid of
e
, we took the natural logarithm on both sides to get an expression that involves a linear function of the features/predictors.
Hence, in simple terms —
Interpretation of the coefficients
In Linear Regression, we could easily say — keeping all other predictors and coefficients fixed, one unit increase in X1 will increase the response by β1. But this is not the case with Logistic Regression because we’ve modeled a relationship to log(Odds). As we know the logarithmic function is non-linear which means we can’t directly interpret the change in the response due to the change in the predictor.
What we can interpret is —
- Increasing X1 by one unit will increase the log(Odds) by β1.
- Increasing X1 by one unit will multiply the Odds by
e^b1
.
Problem
As we’ve modeled the relationship to log(Odds), we have to transform the Y-axis accordingly, leading to expanding the Y-axis from +∞ to -∞.
Y-axis Transformation — Probability to log(Odds)
As we can see in the scatterplot shown below, all points are lying on either 1 or 0, as test_result
is a discrete response variable.
So, we’ll transform the Y-axis from the original plot (figure 2) to log(Odds). According to the calculations —
Note: ln(Odds) = 0 when ŷ = 0.5.
Now, if we consider a set of relatively fewer data points for the illustration purpose, the plot looks like this —
Note: The green and red points belong to the class
1
and0
respectively.
Back to the Original Plot (Figure 2) — log(Odds) to Probability
Now, we have to calculate the log(Odds) values for each point on the plot and plot those back in the original plot where the Y-axis represented the probability of the response variable belonging to class 1, we would need to calculate the log(Odds) from the graph (Figure 4) and plug them in the equation below to get the probability of the respective points —
Now, plotting the probability for each point back to the original graph (Figure 2) —
What does the likelihood mean in this context?
In Logistic Regression, the likelihood function is based on the probabilities of the observed responses. For each observation, the likelihood is the probability of observing the actual response that occurred.
The product of these probabilities is then maximized to find the optimal parameter values or coefficients — and this is called MLE(Maximum Likelihood Estimation.
Hence, likelihood =
In simpler form, likelihood =
Likelihood to Cross-entropy — Product to Summation
Now, we’ll take the log of the likelihood function. Taking the logarithm of the likelihood function is commonly done because it simplifies calculations. Logarithms convert products into sums, which can make computations easier, especially when dealing with large datasets. Additionally, logarithms do not change the location of the maximum point of the function, so maximizing the log-likelihood is equivalent to maximizing the likelihood itself.
Hence, this simplifies optimization procedures.
We know, log(AB) = log(A) + log(B)
But if we are to take log(likelihood), we need to consider two issues with this approach —
- ln(x) is always negative if x ∈ (0,1). So, to overcome this, we have to take the negative log(likelihood). Also, the summation of the negative log(likelihood) is equal to cross-entropy.
- Since we’re taking a negative log(likelihood), it leads to
-ln(0.1) > -ln(0.9)
, now to maximize the likelihood, we’ll have to minimize the cross-entropy function by optimization methods like gradient descent to fetch us the best fit.
Cost/Loss Function
Therefore, the final Cost/Loss function for Logistic Regression —
- ŷ— Probability of the response of belonging to a particular class.
- m — Total number of rows.
- yi— Original class of the response.
Here, the mean of cross-entropy is considered for the cost function.
If you’ve made it this far, I sincerely hope you’ve found this article helpful. Feel free to share your valuable feedback in the comments below.
Thank you. Let’s connect on LinkedIn.