Multiple linear regression

Types of predictors cont’d + Model comparison

Prof. Maria Tackett

Jan 30, 2025

term	estimate	std.error	statistic	p.value
(Intercept)	10.726	1.507	7.116	0.000
debt_to_income	0.671	0.676	0.993	0.326
verified_incomeSource Verified	2.211	1.399	1.581	0.121
verified_incomeVerified	6.880	1.801	3.820	0.000
annual_income_th	-0.021	0.011	-1.804	0.078

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	10.726	1.507	7.116	0.000	7.690	13.762
debt_to_income	0.671	0.676	0.993	0.326	-0.690	2.033
verified_incomeSource Verified	2.211	1.399	1.581	0.121	-0.606	5.028
verified_incomeVerified	6.880	1.801	3.820	0.000	3.253	10.508
annual_income_th	-0.021	0.011	-1.804	0.078	-0.043	0.002

Centering

Centering a quantitative predictor means shifting every value by some constant $C$

$X_{c e n t} = X - C$

One common type of centering is mean-centering, in which every value of a predictor is shifted by its mean
Only quantitative predictors are centered
Center all quantitative predictors in the model for ease of interpretation

What is one reason one might want to center the quantitative predictors? What is are the units of centered variables?

term	estimate	std.error	statistic	p.value
(Intercept)	9.444	0.977	9.663	0.000
debt_to_inc_cent	0.671	0.676	0.993	0.326
verified_incomeSource Verified	2.211	1.399	1.581	0.121
verified_incomeVerified	6.880	1.801	3.820	0.000
annual_inc_cent	-0.021	0.011	-1.804	0.078

Term	Original Model	Centered Model
(Intercept)	10.726	9.444
debt_to_income	0.671	0.671
verified_incomeSource Verified	2.211	2.211
verified_incomeVerified	6.880	6.880
annual_income_th	-0.021	-0.021

Standardizing a quantitative predictor mean shifting every value by the mean and dividing by the standard deviation of that variable

$X_{s t d} = \frac{X - \bar{X}}{S_{X}}$

What is one reason one might want to standardize the quantitative predictors? What is are the units of standardized variables?

term	estimate	std.error	statistic	p.value
(Intercept)	9.444	0.977	9.663	0.000
debt_to_inc_std	0.643	0.648	0.993	0.326
verified_incomeSource Verified	2.211	1.399	1.581	0.121
verified_incomeVerified	6.880	1.801	3.820	0.000
annual_inc_std	-1.180	0.654	-1.804	0.078

Term	Original Model	Standardized Model
(Intercept)	10.726	9.444
debt_to_income	0.671	0.643
verified_incomeSource Verified	2.211	2.211
verified_incomeVerified	6.880	6.880
annual_income_th	-0.021	-1.180

term	estimate	std.error	statistic	p.value
(Intercept)	9.560	2.034	4.700	0.000
debt_to_income	0.691	0.685	1.009	0.319
verified_incomeSource Verified	3.577	2.539	1.409	0.166
verified_incomeVerified	9.923	3.654	2.716	0.009
annual_income_th	-0.007	0.020	-0.341	0.735
verified_incomeSource Verified:annual_income_th	-0.016	0.026	-0.643	0.523
verified_incomeVerified:annual_income_th	-0.032	0.033	-0.979	0.333

term	estimate	std.error	statistic	p.value
(Intercept)	9.560	2.034	4.700	0.000
debt_to_income	0.691	0.685	1.009	0.319
verified_incomeSource Verified	3.577	2.539	1.409	0.166
verified_incomeVerified	9.923	3.654	2.716	0.009
annual_income_th	-0.007	0.020	-0.341	0.735
verified_incomeSource Verified:annual_income_th	-0.016	0.026	-0.643	0.523
verified_incomeVerified:annual_income_th	-0.032	0.033	-0.979	0.333

When comparing models, do we prefer the model with the lower or higher RMSE?
Though we use $R^{2}$ to assess the model fit, it is generally unreliable for comparing models with different number of predictors. Why?
- $R^{2}$ will stay the same or increase as we add more variables to the model . Let’s show why this is true.
- If we only use $R^{2}$ to choose a best fit model, we will be prone to choose the model with the most predictor variables.

Multiple linear regression Types of predictors cont’d + Model comparison Prof. Maria Tackett Jan 30, 2025