# load packages
library(tidyverse)
library(tidymodels)
library(knitr)
<- read_csv("data/rail_trail.csv") rail_trail
Variance Inflation Factors
Here we explain the connection between the Variance Inflation Factor (VIF) and
Unit length scaling
We have talked about standardizing predictors, such that
The standardized predictors have a mean of 0 and variance of 1.
Another common type of scaling is unit length scaling. We will denote these scaled predictors as
The scaled response variable, denoted
where
We will use the scaled predictor and response variable to show the relationship between
When we use the scaled predictors and response.
Variance inflation factor
We will use the rail_trail
data from the notes to illustrate this connection. We will focus on the predictors hightemp
, avgtemp
, and precip
.
We begin by creating unit length scaled versions of the variables.
<- sum((rail_trail$hightemp - mean(rail_trail$hightemp))^2)
hightemp_norm <- sum((rail_trail$avgtemp - mean(rail_trail$avgtemp))^2)
avgtemp_norm <- sum((rail_trail$precip - mean(rail_trail$precip))^2)
precip_norm <- sum((rail_trail$volume - mean(rail_trail$volume))^2)
volume_norm
<- rail_trail |>
rail_trail mutate(hightemp_scaled = (hightemp - mean(hightemp)) / hightemp_norm^.5,
avgtemp_scaled = (avgtemp - mean(avgtemp)) / avgtemp_norm^.5,
precip_scaled = (precip - mean(precip)) / precip_norm^.5,
volume_scaled = (volume - mean(volume)) / volume_norm^.5
)
The matrix
# use -1 to remove the intercept for the correlation matrix
<- model.matrix(volume_scaled ~ hightemp_scaled + avgtemp_scaled + precip_scaled - 1, data = rail_trail)
W
t(W)%*%W
hightemp_scaled avgtemp_scaled precip_scaled
hightemp_scaled 1.0000000 0.9196439 0.1343172
avgtemp_scaled 0.9196439 1.0000000 0.2725832
precip_scaled 0.1343172 0.2725832 1.0000000
When we fit a model using the unit length scaling for the response and predictor variables, we would expect
<- lm(volume_scaled ~ hightemp_scaled + avgtemp_scaled + precip_scaled , data = rail_trail)
trail_model_scaled
<- tidy(trail_model_scaled)$std.error
beta_se <- glance(trail_model_scaled)$sigma
sigma
^2 / sigma^2 beta_se
[1] 0.01111111 7.16188175 7.59715405 1.19343051
(You can ignore the first element, which represents the intercept).
These values show how much the standard errors of the coefficients are inflated given the correlation between the predictors (the off diagonal elements of
Under this model using the unit-length-scaled predictors and response, we see these variance inflation factors are equal to the diagonal elements of
<- solve(t(W) %*% W)
C diag(C)
hightemp_scaled avgtemp_scaled precip_scaled
7.161882 7.597154 1.193431
Thus,