Types of predictors cont’d + Model comparison
Jan 30, 2025
HW 01 due TODAY at 11:59pm
Team labs start on Friday
Click here to learn more about the Academic Resource Center
Statistics experience due Tuesday, April 22
Centering quantitative predictors
Standardizing quantitative predictors
Interaction terms
Model comparison
RMSE
Today’s data is a sample of 50 loans made through a peer-to-peer lending club. The data is in the loan50
data frame in the openintro R package.
# A tibble: 50 × 4
annual_income_th debt_to_income verified_income interest_rate
<dbl> <dbl> <fct> <dbl>
1 59 0.558 Not Verified 10.9
2 60 1.31 Not Verified 9.92
3 75 1.06 Verified 26.3
4 75 0.574 Not Verified 9.92
5 254 0.238 Not Verified 9.43
6 67 1.08 Source Verified 9.92
7 28.8 0.0997 Source Verified 17.1
8 80 0.351 Not Verified 6.08
9 34 0.698 Not Verified 7.97
10 80 0.167 Source Verified 12.6
# ℹ 40 more rows
Predictors:
annual_income_th
: Annual income (in $1000s)debt_to_income
: Debt-to-income ratio, i.e. the percentage of a borrower’s total debt divided by their total incomeverified_income
: Whether borrower’s income source and amount have been verified (Not Verified
, Source Verified
, Verified
)Response: interest_rate
: Interest rate for the loan
Goal: Use these predictors in a single model to understand variability in interest rate.
int_fit <- lm(interest_rate ~ debt_to_income + verified_income + annual_income_th,
data = loan50)
tidy(int_fit) |>
kable(digits = 3)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 10.726 | 1.507 | 7.116 | 0.000 |
debt_to_income | 0.671 | 0.676 | 0.993 | 0.326 |
verified_incomeSource Verified | 2.211 | 1.399 | 1.581 | 0.121 |
verified_incomeVerified | 6.880 | 1.801 | 3.820 | 0.000 |
annual_income_th | -0.021 | 0.011 | -1.804 | 0.078 |
verified_income
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 10.726 | 1.507 | 7.116 | 0.000 | 7.690 | 13.762 |
debt_to_income | 0.671 | 0.676 | 0.993 | 0.326 | -0.690 | 2.033 |
verified_incomeSource Verified | 2.211 | 1.399 | 1.581 | 0.121 | -0.606 | 5.028 |
verified_incomeVerified | 6.880 | 1.801 | 3.820 | 0.000 | 3.253 | 10.508 |
annual_income_th | -0.021 | 0.011 | -1.804 | 0.078 | -0.043 | 0.002 |
Not verified
.One common type of centering is mean-centering, in which every value of a predictor is shifted by its mean
Only quantitative predictors are centered
Center all quantitative predictors in the model for ease of interpretation
What is one reason one might want to center the quantitative predictors? What is are the units of centered variables?
Use the scale()
function with center = TRUE
and scale = FALSE
to mean-center variables
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 9.444 | 0.977 | 9.663 | 0.000 |
debt_to_inc_cent | 0.671 | 0.676 | 0.993 | 0.326 |
verified_incomeSource Verified | 2.211 | 1.399 | 1.581 | 0.121 |
verified_incomeVerified | 6.880 | 1.801 | 3.820 | 0.000 |
annual_inc_cent | -0.021 | 0.011 | -1.804 | 0.078 |
Term | Original Model | Centered Model |
---|---|---|
(Intercept) | 10.726 | 9.444 |
debt_to_income | 0.671 | 0.671 |
verified_incomeSource Verified | 2.211 | 2.211 |
verified_incomeVerified | 6.880 | 6.880 |
annual_income_th | -0.021 | -0.021 |
How has the model changed? How has the model remained the same?
Only quantitative predictors are standardized
Standardize all quantitative predictors in the model for ease of interpretation
What is one reason one might want to standardize the quantitative predictors? What is are the units of standardized variables?
Use the scale()
function with center = TRUE
and scale = TRUE
to standardized variables
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 9.444 | 0.977 | 9.663 | 0.000 |
debt_to_inc_std | 0.643 | 0.648 | 0.993 | 0.326 |
verified_incomeSource Verified | 2.211 | 1.399 | 1.581 | 0.121 |
verified_incomeVerified | 6.880 | 1.801 | 3.820 | 0.000 |
annual_inc_std | -1.180 | 0.654 | -1.804 | 0.078 |
Term | Original Model | Standardized Model |
---|---|---|
(Intercept) | 10.726 | 9.444 |
debt_to_income | 0.671 | 0.643 |
verified_incomeSource Verified | 2.211 | 2.211 |
verified_incomeVerified | 6.880 | 6.880 |
annual_income_th | -0.021 | -1.180 |
How has the model changed? How has the model remained the same?
The lines are not parallel indicating there is a potential interaction effect. The slope of annual income potentially differs based on the income verification.
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 9.560 | 2.034 | 4.700 | 0.000 |
debt_to_income | 0.691 | 0.685 | 1.009 | 0.319 |
verified_incomeSource Verified | 3.577 | 2.539 | 1.409 | 0.166 |
verified_incomeVerified | 9.923 | 3.654 | 2.716 | 0.009 |
annual_income_th | -0.007 | 0.020 | -0.341 | 0.735 |
verified_incomeSource Verified:annual_income_th | -0.016 | 0.026 | -0.643 | 0.523 |
verified_incomeVerified:annual_income_th | -0.032 | 0.033 | -0.979 | 0.333 |
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 9.560 | 2.034 | 4.700 | 0.000 |
debt_to_income | 0.691 | 0.685 | 1.009 | 0.319 |
verified_incomeSource Verified | 3.577 | 2.539 | 1.409 | 0.166 |
verified_incomeVerified | 9.923 | 3.654 | 2.716 | 0.009 |
annual_income_th | -0.007 | 0.020 | -0.341 | 0.735 |
verified_incomeSource Verified:annual_income_th | -0.016 | 0.026 | -0.643 | 0.523 |
verified_incomeVerified:annual_income_th | -0.032 | 0.033 | -0.979 | 0.333 |
Write the regression equation for the people with Not Verified
income.
Write the regression equation for people with Verified
income.
annual_income
for source verified: If the income is source verified, we expect the interest rate to decrease by 0.023% (-0.007 + -0.016) for each additional thousand dollars in annual income, holding all else constant.In general, how do
indicators for categorical predictors impact the model equation?
interaction terms impact the model equation?
Root mean square error, RMSE: A measure of the average error (average difference between observed and predicted values of the outcome)
R-squared,
When comparing models, do we prefer the model with the lower or higher RMSE?
Though we use
If we only use
glance()
function to get where
Fit and interpreted models with centered and standardized variables
Interpreted interaction terms
Used RMSE and
Inference for regression