Multicollinearity

Prof. Maria Tackett

Feb 27, 2025

Computing set up

# load packages
library(tidyverse)  
library(tidymodels)  
library(knitr)       
library(patchwork)
library(GGally)   # for pairwise plot matrix
library(corrplot) # for correlation matrix

# set default theme in ggplot2
ggplot2::theme_set(ggplot2::theme_bw())

Multicollinearity

Ideally the predictors are orthogonal, meaning they are completely independent of one another
In practice, there is typically some dependence between predictors but it is often not a major issue in the model
If there is linear dependence among (a subset of) the predictors, we cannot find estimate $\hat{β}$
If there are near-linear dependencies, we can find $\hat{β}$ but there may be other issues with the model
Multicollinearity: near-linear dependence among predictors

Detecting multicollinearity

Common practice uses threshold $V I F > 10$ as indication of concerning multicollinearity (some say VIF > 5 is worth investigation)
Variables with similar values of VIF are typically the ones correlated with each other
Use the vif() function in the rms R package to calculate VIF

library(rms)

trail_fit <- lm(volume ~ hightemp + avgtemp + precip, data = rail_trail)

vif(trail_fit)

hightemp  avgtemp   precip 
7.161882 7.597154 1.193431

1 / 22

Multicollinearity Prof. Maria Tackett Feb 27, 2025

Multicollinearity

Announcements

Computing set up

Topics

Data: Trail users

Variables

EDA: Relationship between predictors

EDA: Relationship between predictors

EDA: Correlation matrix

EDA: Correlation matrix

Multicollinearity

Multicollinearity

Sources of multicollinearity

Detecting multicollinearity

Variance inflation factor

Detecting multicollinearity

How multicollinearity impacts model

Application exercise

Dealing with multicollinearity

Application exercise

Recap

References