Variance Inflation Factors

# load packages
library(tidyverse)  
library(tidymodels)  
library(knitr)       

rail_trail <- read_csv("data/rail_trail.csv")

Here we explain the connection between the Variance Inflation Factor (VIF) and C=(XTX)1. This explanation is motivated by Chapter 3 of Montgomery, Peck, and Vining ().

Unit length scaling

We have talked about standardizing predictors, such that

xijstd=xijxj¯sxj such that x¯j is the mean and sxj is the standard deviation of the predictor xj.

The standardized predictors have a mean of 0 and variance of 1.

Another common type of scaling is unit length scaling. We will denote these scaled predictors as wj. We apply this scaling on both the predictor and response variable in the following way.

wij=xijxj¯sjj where sjj=i=1n(xijx¯j)2

The scaled response variable, denoted y0 is

yi0=yiy¯SST

where SST is the sum of squares total, i=1n(yiy¯)2

We will use the scaled predictor and response variable to show the relationship between C=(XTX)1 and the formula for VIF. More specifically, that

Cjj=VIFj=11Rj2

When we use the scaled predictors and response.

Variance inflation factor

We will use the rail_trail data from the notes to illustrate this connection. We will focus on the predictors hightemp, avgtemp, and precip.

We begin by creating unit length scaled versions of the variables.

hightemp_norm <- sum((rail_trail$hightemp - mean(rail_trail$hightemp))^2)
avgtemp_norm <- sum((rail_trail$avgtemp - mean(rail_trail$avgtemp))^2)
precip_norm <- sum((rail_trail$precip - mean(rail_trail$precip))^2)
volume_norm <- sum((rail_trail$volume - mean(rail_trail$volume))^2)

rail_trail <- rail_trail |>
  mutate(hightemp_scaled = (hightemp - mean(hightemp)) / hightemp_norm^.5,
         avgtemp_scaled = (avgtemp - mean(avgtemp)) / avgtemp_norm^.5, 
         precip_scaled = (precip - mean(precip)) / precip_norm^.5,
         volume_scaled = (volume - mean(volume)) / volume_norm^.5
         )

The matrix WTW is equivalent to the correlation matrix for these predictors.

# use -1 to remove the intercept for the correlation matrix
W <- model.matrix(volume_scaled ~ hightemp_scaled + avgtemp_scaled + precip_scaled - 1, data = rail_trail)

t(W)%*%W
                hightemp_scaled avgtemp_scaled precip_scaled
hightemp_scaled       1.0000000      0.9196439     0.1343172
avgtemp_scaled        0.9196439      1.0000000     0.2725832
precip_scaled         0.1343172      0.2725832     1.0000000

When we fit a model using the unit length scaling for the response and predictor variables, we would expect Var(β^j)/σ^ϵ21. As we see below, however, these values are greater than 1.

trail_model_scaled <- lm(volume_scaled ~ hightemp_scaled + avgtemp_scaled + precip_scaled , data = rail_trail)

beta_se <- tidy(trail_model_scaled)$std.error
sigma <- glance(trail_model_scaled)$sigma

beta_se^2 / sigma^2
[1] 0.01111111 7.16188175 7.59715405 1.19343051

(You can ignore the first element, which represents the intercept).

These values show how much the standard errors of the coefficients are inflated given the correlation between the predictors (the off diagonal elements of WTW). The amount by which the standard errors are inflated are called the variance inflation factors (VIF).

Under this model using the unit-length-scaled predictors and response, we see these variance inflation factors are equal to the diagonal elements of C=(WTW)1.

C <- solve(t(W) %*% W)
diag(C)
hightemp_scaled  avgtemp_scaled   precip_scaled 
       7.161882        7.597154        1.193431 

Thus,

Cjj=VIFj=11Rj2

References

Montgomery, Douglas C, Elizabeth A Peck, and G Geoffrey Vining. 2021. Introduction to Linear Regression Analysis. John Wiley & Sons.