Mar 04, 2025
Exam corrections (optional) due TODAY at 11:59pm
Team Feedback (email from TEAMMATES) due TODAY at 11:59pm (check email)
Next project milestone: Exploratory data analysis due March 20
DataFest: April 4 - 6 - https://dukestatsci.github.io/datafest/
Multicollinearity
Recap
How to deal with issues of multicollinearity
# A tibble: 5 × 7
volume hightemp avgtemp season cloudcover precip day_type
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
1 501 83 66.5 Summer 7.60 0 Weekday
2 419 73 61 Summer 6.30 0.290 Weekday
3 397 74 63 Spring 7.5 0.320 Weekday
4 385 95 78 Summer 2.60 0 Weekend
5 200 44 48 Spring 10 0.140 Weekday
Source: Pioneer Valley Planning Commission via the mosaicData package.
Outcome:
volume
estimated number of trail users that day (number of breaks recorded)Predictors
hightemp
daily high temperature (in degrees Fahrenheit)
avgtemp
average of daily low and daily high temperature (in degrees Fahrenheit)
season
one of “Fall”, “Spring”, or “Summer”
precip
measure of precipitation (in inches)
Multicollinearity: near-linear dependence among predictors
The variance inflation factor (VIF) measures how much the linear dependencies impact the variance of the predictors
where
Large variance for the model coefficients that are collinear
Unreliable statistical inference results
Interpretation of coefficient is no longer “holding all other variables constant”, since this would be impossible for correlated predictors
Collect more data (often not feasible given practical constraints)
Redefine the correlated predictors to keep the information from predictors but eliminate collinearity
For categorical predictors, avoid using levels with very few observations as the baseline
Remove one of the correlated variables