Feb 25, 2025
Exam corrections (optional) due Tuesday, March 4 at 11:59pm
Project proposal due TODAY at 11:59pm
Model conditions
Influential points
Model diagnostics
Leverage
Studentized residuals
Cook’s Distance
Today’s data contains a subset of the original Duke Lemur data set available in the TidyTuesday GitHub repo. This data includes information on “young adult” lemurs from the Coquerel’s sifaka species (PCOQ), the largest species at the Duke Lemur Center. The analysis will focus on the following variables:
age_at_wt_mo
: Age in months: Age of the animal when the weight was taken, in months (((Weight_Date-DOB)/365)*12)
weight_g
: Weight: Animal weight, in grams. Weights under 500g generally to nearest 0.1-1g; Weights >500g generally to the nearest 1-20g.
The goal of the analysis is to use the age of the lemurs to understand variability in the weight.
How do we know if these assumptions hold in our data?
Constant variance is critical for reliable inference
Address violations by applying transformation on the response
We can often check the independence condition based on the context of the data and how the observations were collected.
If the data were collected in a particular order, examine a scatterplot of the residuals versus order in which the data were collected.
If data has spatial element, plot residuals on a map to examine potential spatial correlation.
# A tibble: 10 × 8
weight_g age_at_wt_mo .fitted .resid .hat .sigma .cooksd .std.resid
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3400 32.0 3557. -157. 0.0302 494. 0.00164 -0.324
2 3620 33.0 4063. -443. 0.0399 491. 0.0176 -0.922
3 3720 32.4 3800. -80.0 0.0163 495. 0.000224 -0.164
4 4440 32.6 3850. 590. 0.0177 489. 0.0132 1.21
5 3770 31.8 3457. 313. 0.0458 493. 0.0102 0.652
6 3920 31.9 3522. 398. 0.0350 492. 0.0124 0.826
7 4520 32.8 3979. 541. 0.0279 490. 0.0180 1.12
8 3700 33.2 4177. -477. 0.0626 491. 0.0337 -1.01
9 3690 31.9 3537. 153. 0.0329 494. 0.00172 0.318
10 3790 32.8 3949. -159. 0.0247 494. 0.00136 -0.328
Use the augment()
function in the broom package to output the model diagnostics (along with the predicted values and residuals)
.fitted
: predicted values.se.fit
: standard errors of predicted values.resid
: residuals.hat
: leverage.sigma
: estimate of residual standard deviation when the corresponding observation is dropped from model.cooksd
: Cook’s distance.std.resid
: standardized residualsAn observation is influential if removing has a noticeable impact on the regression coefficients
An observation’s influence on the regression line depends on
How close it lies to the general trend of the data
Its leverage
Cook’s Distance is a statistic that includes both of these components to measure an observation’s overall impact on the model
Cook’s distance for the
where
This measure is a combination of
How well the model fits the
How far the
An observation with large value of
General thresholds .An observation with
Cook’s Distance is in the column .cooksd
in the output from the augment()
function
With influential point
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -12314.360 | 4252.696 | -2.896 | 0.005 |
age_at_wt_mo | 496.591 | 131.225 | 3.784 | 0.000 |
Without influential point
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -6670.958 | 3495.136 | -1.909 | 0.061 |
age_at_wt_mo | 321.209 | 107.904 | 2.977 | 0.004 |
Let’s better understand the influential point.
Recall the hat matrix
We focus on the diagonal elements
Observations with large values of
The sum of the leverages for all points is
The average value of leverage,
An observation has large leverage if
# A tibble: 2 × 8
weight_g age_at_wt_mo .fitted .resid .hat .sigma .cooksd .std.resid
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4040 33.5 4336. -296. 0.107 493. 0.0244 -0.639
2 6519 33.4 4272. 2247. 0.0871 389. 1.10 4.79
Why do you think these points have large leverage?
If there is point with high leverage, ask
❓ Is there a data entry error?
❓ Is this observation within the scope of individuals for which you want to make predictions and draw conclusions?
❓ Is this observation impacting the estimates of the model coefficients? (Need more information!)
Just because a point has high leverage does not necessarily mean it will have a substantial impact on the regression. Therefore we need to check other measures.
What is the best way to identify outlier points that don’t fit the pattern from the regression line?
We can rescale residuals and put them on a common scale to more easily identify “large” residuals
We will consider two types of scaled residuals: standardized residuals and studentized residuals
The variance of the residuals can be estimated by the mean squared residuals (MSR)
We can use MSR to compute standardized residuals
Standardized residuals are produced by augment()
in the column .std.resid
We can examine the standardized residuals directly from the output from the augment()
function
Let’s look at the value of the response variable to better understand potential outliers
MSR is an approximation of the variance of the residuals.
The variance of the residuals is
The studentized residual is the residual rescaled by the more exact calculation for variance
Standardized residuals, leverage, and Cook’s Distance should all be examined together
Examine plots of the measures to identify observations that are outliers, high leverage, and may potentially impact the model.
First consider if the outlier is a result of a data entry error.
If not, you may consider dropping an observation if it’s an outlier in the predictor variables if…
It is meaningful to drop the observation given the context of the problem
You intended to build a model on a smaller range of the predictor variables. Mention this in the write up of the results and be careful to avoid extrapolation when making predictions
It is generally not good practice to drop observations that ar outliers in the value of the response variable
These are legitimate observations and should be in the model
You can try transformations or increasing the sample size by collecting more data
A general strategy when there are influential points is to fit the model with and without the influential points and compare the outcomes
Model conditions
Influential points
Model diagnostics
Leverage
Studentized residuals
Cook’s Distance