Model conditions + diagnostics

Prof. Maria Tackett

Feb 25, 2025

Computing set up

# load packages
library(tidyverse)  
library(tidymodels)  
library(knitr)       
library(patchwork)   
library(viridis)

# set default theme in ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 16))

term	estimate	std.error	statistic	p.value
(Intercept)	-12314.360	4252.696	-2.896	0.005
age_at_wt_mo	496.591	131.225	3.784	0.000

Assumptions for regression

$y | X \sim N (X β, σ_{ϵ}^{2} I)$

Linearity: There is a linear relationship between the response and predictor variables.
Constant Variance: The variability about the least squares line is generally constant.
Normality: The distribution of the errors (residuals) is approximately normal.
Independence: The errors (residuals) are independent from one another.

How do we know if these assumptions hold in our data?

term	estimate	std.error	statistic	p.value
(Intercept)	-12314.360	4252.696	-2.896	0.005
age_at_wt_mo	496.591	131.225	3.784	0.000

term	estimate	std.error	statistic	p.value
(Intercept)	-6670.958	3495.136	-1.909	0.061
age_at_wt_mo	321.209	107.904	2.977	0.004

Studentized residuals

MSR is an approximation of the variance of the residuals.
The variance of the residuals is $V a r (e) = σ_{ϵ}^{2} (I - H)$
- The variance of the $i^{t h}$ residual is $V a r (e_{i}) = σ_{ϵ}^{2} (1 - h_{i i})$
The studentized residual is the residual rescaled by the more exact calculation for variance

$r_{i} = \frac{e_{i}}{\sqrt{{\hat{σ}}_{ϵ}^{2} (1 - h_{i i})}}$

Standardized and studentized residuals provide similar information about which points are outliers in the response.
- Studentized residuals are used to compute Cook’s Distance.

1 / 44

Model conditions + diagnostics Prof. Maria Tackett Feb 25, 2025

Model conditions + diagnostics
Announcements
Computing set up
Topics
Data: Duke lemurs
EDA
EDA
Fit model
Model conditions
Assumptions for regression
Linearity
Example: Linearity not satisfied
Constant variance
Example: Constant variance not satisfied
Normality
Independence
Model diagnostics
Model diagnostics
Model diagnostics in R
Influential Point
Influential points
Cook’s Distance
Motivating Cook’s Distance
Cook’s Distance
Using Cook’s Distance
Cook’s Distance
Comparing models
Leverage
Leverage
Large leverage
Lemurs: Leverage
Let’s look at the data
Large leverage
Scaled residuals
Scaled residuals
Standardized residuals
Using standardized residuals
Digging in to the data
Studentized residuals
Using these measures
Back to the influential point
What to do with outliers/influential points?
What to do with outliers/influential points?
Recap