Multicollinearity Explained: Dealing with Correlated Variables in Regression Analysis

Sougat Dey
10 min readJul 2, 2024

--

Most experts say this is one of a Data Science interview's most asked technical questions. Multicollinearity is one of the six assumptions of multiple linear regression, making it a simple yet nuanced concept to understand.

source: X(aka Twitter)

Before we begin, I advise the readers to familiarize themselves with —

  1. Linear Regression using OLS
  2. Linear Algebra basics

Connect with me on LinkedIn & find my projects on Kaggle

Simply put, Multicollinearity is a statistical phenomenon that occurs when two or more independent variables in a multiple regression are highly correlated. In other words, these variables exhibit a strong linear relationship, making it difficult to isolate the individual effects of each variable on the dependent variable.

Now, it leaves us to answer three questions —

  1. Is multicollinearity explicitly prohibited?
  2. How does multicollinearity affect a multiple regression model?
  3. How do we detect multicollinearity?

Let’s try to answer the first question. So, is multicollinearity explicitly prohibited? The short answer to this is no, multicollinearity won’t always be an issue. See, when we develop a multiple linear regression model, it primarily has two objectives — inference and prediction. Multicollinearity isn’t a problem if your model’s main goal is to make predictions (we’ll discuss the why part in the later part of the article). On the contrary, when you’re working on a problem where the model is being used for making inferences or in other words, model interpretability is of utmost importance, multicollinearity is a huge issue and can’t be ignored.

Before we dive into the whys and hows, I’d like to discuss the mathematics behind multiple linear regression, so that we understand not only the abstract concept of multicollinearity but also its mathematical underpinnings.

Mathematics behind Multiple Linear Regression: Matrix Notation

In the context of multiple linear regression, the Normal Equation (or the Ordinary Least Squares (OLS) estimator) is used to find the coefficients that minimize the sum of the squared differences between the observed values and the values predicted by the linear model. The final equation for finding the beta coefficients can be derived as —

This equation provides the OLS estimator for the coefficients in multiple linear regression

Note: If you prefer not to delve into the math behind the equation’s derivation, I suggest skipping this section.

Here’s the equation for multiple linear regression —

We can represent the predicted y values in a vector format —

Now, if we replace the values with the equation we derived earlier —

We can represent this row vector as a product of a matrix and column vector like this —

From here, we can conclude —

Earlier for simplification, we did not consider the error part (residuals) in prediction. Now, we know the residuals are calculated by subtracting y hat from y. If we represent the error in matrix notation —

We can rewrite it like this —

by performing matrix subtraction, we get —

Now, we know, the sum of squared residuals —

In matrix notations, the same can be written as —

Note: If A and B are two vectors of the same dimensions then (A.T.B) is the same as (B.T.A) since the product of vectors in this context (dot product) is commutative.

Hence, we can conclude the loss function as —

The residual sum of squares (RSS) is given by

Now, we take the derivative of the loss function and equate it to zero to find the minimum of the function. This process, known as optimization, allows us to determine the values of the parameters (coefficients in regression) that minimize the error between our model’s predictions and the actual data points.

By setting the derivative to zero, we identify the critical points where the function’s rate of change is zero, which often corresponds to the optimal solution in many machine learning algorithms.

Hence, we can conclude —

How does multicollinearity affect a multiple regression model?

Multicollinearity in a multiple linear regression setting isn’t explicitly unacceptable. Multicollinearity poses significant challenges only when the model is used for making inferences. If the model is going to be used for just making predictions, multicollinearity isn’t that much of a problem. However, it’s crucial to note that even in prediction-focused tasks, severe multicollinearity can still distort the model’s predictions if left unaddressed.

Let’s first understand how multicollinearity affects a model when used for making inferences or in other words, when the model interpretability is of utmost importance.

Inference

We’ve already discussed the formal definition of multicollinearity which says it occurs when two or more variables are highly correlated. However, there’s a concept of perfect multicollinearity. Let’s understand it first.

Perfect multicollinearity occurs in a regression setting when one predictor can be modeled as an exact linear combination of two or more other predictors. Here’s an example of the same —

Inflation = 2 * GDP_Growth + 3 * Unemployment

Now, if we consider the design matrix (X) and try to take the inverse of (X.T.X), we’d find out that it’s impossible. Let’s understand why —

A singular matrix is a square matrix whose determinant is zero

We could see when perfect multicollinearity exists (X.T.X) is no longer invertible since the det(X.T.X) is zero. Hence, beta coefficients can’t be derived either. But the good news is in real-world datasets perfect multicollinearity doesn’t exist. Even if there’s such a scenario, removing that particular column solves the issue permanently.

Now, let’s understand what the issues are when multicollinearity exists

  1. Difficulty in identifying the independent relationship of each predictor with the target/label — We’ve already mentioned the equation for multiple linear regression which is given by —

The way to interpret the equation is — when all other predictors are held constant, a one-unit increase in X1 will result in an increase of β1 in the predicted value of the dependent variable (target), and similarly for other predictors.

Now, suppose X1 can be modeled as a linear combination of X2 and X3. Then whenever we increase or decrease X2 or X3, X1 will change. This dependency compromises the independent relationship between the target (label) and the predictors X1, X2, and X3, potentially jeopardizing the reliability of the regression coefficients. And this is how multicollinearity compromises model interpretability. Hence, the model can no longer be used for making inferences.

2. Inflated Standard Error (SE) of the coefficients — Let’s first understand what the standard error of the coefficients is. SE is the measure of the variability of the regression coefficients. It indicates how much the coefficients would vary if different samples were taken from the same population. SE is the square root of the variance of the coefficients. The variance of the coefficients is given by —

sigma square is the error variance (estimated from the residual)

Now, the standard error (SE) of the coefficients —

diag means the diagonal items of the following matrix

Now, in the case of strong multicollinearity, the determinant of (X.T.X) will be a very small number. This small determinant indicates that (X.T.X) is nearly singular, making its inverse very large. When calculating the inverse, the resulting elements will be inflated, particularly the diagonal elements, leading to inflated variances of the estimated coefficients.

3. Unstable Coefficients — When there’s multicollinearity in the system, the estimated coefficients become very sensitive which means a slight change in the training data can lead to huge variations in the estimated coefficients. This happens because the standard errors of the coefficients increase, which makes the coefficients more variable and we’ve earlier discussed how and why SE increases if there’s multicollinearity in the system.

Prediction

Even in the case of perfect multicollinearity, the model mostly works just fine. Here’s why —

Suppose, X1 can be modeled as a linear combination of X2 and X3. Then X1 can be written as —

Hence, the aforementioned equation becomes —

As we can see, the resultant equation also accurately presents a multiple linear regression. That’s why when the model is going to be used for just making predictions, multicollinearity isn’t that much of a problem.

Now, before we move to the next section where we discuss how multicollinearity can be removed, we should understand the types of multicollinearity present in a dataset —

  1. Structural Multicollinearity: It refers to a type of multicollinearity that is inherent in the model specification or the nature of the variables being used, rather than being a result of the specific data sample. For example — the model includes both a variable and its nonlinear transformation (e.g., X and X²) or when the dummy variables are improperly coded (e.g., including all categories), it’s also called Dummy Variable Trap.
  2. Data-driven Multicollinearity: It occurs when predictor variables in a regression model show a high correlation within a specific dataset, without any inherent or structural relationship between them. This type of multicollinearity is unique to the sample at hand and may not persist in other samples from the same population. It arises from the particular characteristics of the collected data rather than from the model’s structure or variable definitions. Unlike structural multicollinearity, data-driven multicollinearity can potentially be addressed by gathering more or different data.

How do detect multicollinearity?

The easiest way to detect multicollinearity is to compute the pairwise correlation of columns and visualize it with a Seaborn Heatmap.

Context: The dataset we previously worked on to explain perfect multicollinearity

If there are off-diagonal elements with high absolute values (e.g., greater than 0.8 or 0.9, depending on the specific application and the level of concern about multicollinearity), high correlation values indicate that the corresponding predictor variables are highly correlated and may be causing multicollinearity issues in the regression model.

Variance Inflation Factor (VIF)

The Variance Inflation Factor (VIF) helps detect multicollinearity by measuring how much the variance of an estimated regression coefficient is increased due to collinearity with other predictors.

It’s calculated for each predictor by running an auxiliary regression of that predictor on all other predictors and using the R-squared from this regression. The VIF is given by —

VIF values close to 1 indicate minimal multicollinearity for a predictor. Higher VIF values (typically above 5 or 10) suggest significant multicollinearity, potentially making the predictor’s coefficient estimate unreliable. The specific threshold for concern may vary by context.

How to remove multicollinearity?

To address multicollinearity in regression models, several approaches can be taken:

1. Variable selection: We can remove one or more of the highly correlated predictors. This can be done based on theoretical importance or by examining which variable contributes more to the model’s explanatory power.

2. Feature combination: We can create a new variable that combines the information from correlated predictors. This could involve averaging them or using domain knowledge to create a meaningful composite variable.

3. Regularization: We can employ regularization techniques like ridge regression(L2) or LASSO(L1) that add a penalty term to the regression equation, effectively shrinking coefficient estimates and reducing their sensitivity to multicollinearity.

4. Principal Component Analysis (PCA): We can transform the original set of predictors into a new set of uncorrelated variables (principal components) and use these as predictors instead.

5. Model re-specification: Especially for structural multicollinearity, reconsidering the model’s formulation might be necessary. This could involve changing the functional form, reconsidering which variables to include, or centering variables for polynomial terms and interactions.

The choice of method depends on the specific context, the nature of the multicollinearity (structural or data-driven), the goals of the analysis, and the interpretability requirements of the model. It’s often beneficial to try multiple approaches and compare their effects on model performance and stability.

Thank you for reading! If you spot any typos or conceptual errors, please let me know in the comments below. Your honest feedback is genuinely appreciated. Cheers!

Don’t forget to connect with me on LinkedIn.

--

--

Sougat Dey
Sougat Dey

Written by Sougat Dey

Just a Data Scientist in making.

No responses yet