Multiple linear regression (MLR)

Author

Prof. Maria Tackett

Published

Jan 23, 2025

Topics

Introducing multiple linear regression
- Exploratory data analysis for multiple linear regression
- Fitting the least squares line
- Interpreting coefficients for quantitative predictors
- Prediction

Computing setup

# load packages
library(tidyverse)
library(tidymodels)
library(openintro)
library(patchwork)
library(knitr)
library(kableExtra)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 16))

Data: Peer-to-peer lender

Today’s data is a sample of 50 loans made through a peer-to-peer lending club. The data is in the loan50 data frame in the openintro R package.

# A tibble: 50 × 4
   annual_income debt_to_income verified_income interest_rate
           <dbl>          <dbl> <fct>                   <dbl>
 1         59000         0.558  Not Verified            10.9 
 2         60000         1.31   Not Verified             9.92
 3         75000         1.06   Verified                26.3 
 4         75000         0.574  Not Verified             9.92
 5        254000         0.238  Not Verified             9.43
 6         67000         1.08   Source Verified          9.92
 7         28800         0.0997 Source Verified         17.1 
 8         80000         0.351  Not Verified             6.08
 9         34000         0.698  Not Verified             7.97
10         80000         0.167  Source Verified         12.6 
# ℹ 40 more rows

Variables

Predictors:

annual_income: Annual income
debt_to_income: Debt-to-income ratio, i.e. the percentage of a borrower’s total debt divided by their total income
verified_income: Whether borrower’s income source and amount have been verified (Not Verified, Source Verified, Verified)

Outcome: interest_rate: Interest rate for the loan

Outcome: `interest_rate`

Min	Median	Max	IQR
5.31	9.93	26.3	5.755

Predictors

Data manipulation 1: Rescale income

loan50 <- loan50 |>
  mutate(annual_income_th = annual_income / 1000)

. . .

Why did we rescale income?

Outcome vs. predictors

. . .

Goal: Use these predictors in a single model to understand variability in interest rate.

. . .

Why do we want to use a single model versus 3 separate simple linear regression models?

Multiple linear regression (MLR)

Based on the analysis goals, we will use a multiple linear regression model of the following form

$\begin{aligned} interest_rate = β_{0} & + β_{1} debt_to_income \\ + β_{2} verified_income \\ + β_{3} annual_income_th \\ + ϵ, ϵ \sim N (0, σ_{ϵ}^{2}) \end{aligned}$

Multiple linear regression

Recall: The simple linear regression model

$Y = β_{0} + β_{1} X + ϵ$

. . .

The form of the multiple linear regression model is

$Y = β_{0} + β_{1} X_{1} + \dots + β_{p} X_{p} + ϵ$

. . .

Therefore,

$E (Y | X_{1}, \dots, X_{p}) = β_{0} + β_{1} X_{1} + \dots + β_{p} X_{p}$

Fitting the least squares line

Similar to simple linear regression, we want to find estimates for $β_{0}, β_{1}, \dots, β_{p}$ that minimize

$\sum_{i = 1}^{n} ϵ_{i}^{2} = \sum_{i = 1}^{n} [y_{i} - {\hat{y}}_{i}]^{2} = \sum_{i = 1}^{n} [y_{i} - (β_{0} + β_{1} x_{i 1} + \dots + β_{p} x_{i p})]^{2}$

. . .

The calculations can be very tedious, especially if $p$ is large

Matrix form of multiple linear regression

Suppose we have $n$ observations, a quantitative response variable, and $p$ > 1 predictors $\underset{y}{\underset{⏟}{[\begin{matrix} y_{1} \\ ⋮ \\ y_{n} \end{matrix}]}} = \underset{X}{\underset{⏟}{[\begin{matrix} 1 & x_{11} & \dots & x_{1 p} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 1 & x_{n 1} & \dots & x_{n p} \end{matrix}]}} \underset{β}{\underset{⏟}{[\begin{matrix} β_{0} \\ β_{1} \\ ⋮ \\ β_{p} \end{matrix}]}} + \underset{ϵ}{\underset{⏟}{[\begin{matrix} ϵ_{1} \\ ⋮ \\ ϵ_{n} \end{matrix}]}}$

What are the dimensions of $y$ , $X$ , $β$ , $ϵ$ ?

Matrix form of multiple linear regression

As with simple linear regression, we have

$Y = X β + ϵ$

. . .

Generalizing the derivations from SLR to $p > 2$ , we have

$\hat{β} = (X^{T} X)^{- 1} X^{T} y$

as before.

Model fit in R

int_fit <- lm(interest_rate ~ debt_to_income + verified_income  + annual_income_th,
              data = loan50)

tidy(int_fit) |>
  kable(digits = 3)

term	estimate	std.error	statistic	p.value
(Intercept)	10.726	1.507	7.116	0.000
debt_to_income	0.671	0.676	0.993	0.326
verified_incomeSource Verified	2.211	1.399	1.581	0.121
verified_incomeVerified	6.880	1.801	3.820	0.000
annual_income_th	-0.021	0.011	-1.804	0.078

Model equation

$\begin{aligned} \hat{interest_rate} = 10.726 & + 0.671 \times debt_to_income \\ + 2.211 \times source_verified \\ + 6.880 \times verified \\ - 0.021 \times annual_income_th \end{aligned}$

Note

We will talk about why there are only two terms in the model for verified_income soon!

Interpreting ${\hat{β}}_{j}$

The estimated coefficient ${\hat{β}}_{j}$ is the expected change in the mean of $Y$ when $X_{j}$ increases by one unit, holding the values of all other predictor variables constant.

. . .

Example: The estimated coefficient for debt_to_income is 0.671. This means for each point in an borrower’s debt to income ratio, the interest rate on the loan is expected to be greater by 0.671%, on average, holding annual income and income verification constant.

Interpreting ${\hat{β}}_{j}$

The estimated coefficient for annual_income_th is -0.021. Interpret this coefficient in the context of the data.

Why do we need to include a statement about holding all other predictors constant?

Interpreting ${\hat{β}}_{0}$

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	10.726	1.507	7.116	0.000	7.690	13.762
debt_to_income	0.671	0.676	0.993	0.326	-0.690	2.033
verified_incomeSource Verified	2.211	1.399	1.581	0.121	-0.606	5.028
verified_incomeVerified	6.880	1.801	3.820	0.000	3.253	10.508
annual_income_th	-0.021	0.011	-1.804	0.078	-0.043	0.002

. . .

Describe the subset of borrowers who are expected to get an interest rate of 10.726% based on our model. Is this interpretation meaningful? Why or why not?

Prediction

What is the predicted interest rate for an borrower with an debt-to-income ratio of 0.558, whose income is not verified, and who has an annual income of $59,000?

10.726 + 0.671 * 0.558 + 2.211 * 0 + 6.880 * 0 - 0.021 * 59

[1] 9.861418

. . .

The predicted interest rate for an borrower with with an debt-to-income ratio of 0.558, whose income is not verified, and who has an annual income of $59,000 is 9.86%.

Prediction in R

Just like with simple linear regression, we can use the predict() function in R to calculate the appropriate intervals for our predicted values:

new_borrower <- tibble(
  debt_to_income  = 0.558, 
  verified_income = "Not Verified", 
  annual_income_th = 59
)

predict(int_fit, new_borrower)

       1 
9.890888

Note

Difference in predicted value due to rounding the coefficients on the previous slide.

Cautions

Do not extrapolate! Because there are multiple predictor variables, there is the potential to extrapolate in many directions
The multiple regression model only shows association, not causality
- To show causality, you must have a carefully designed experiment or carefully account for confounding variables in an observational study

Recap

Showed exploratory data analysis for multiple linear regression
Used least squares to fit the regression line
Interpreted the coefficients for quantitative predictors
Predicted the response for new observations

Topics

Computing setup

Data: Peer-to-peer lender

Variables

Outcome: interest_rate

Predictors

Data manipulation 1: Rescale income

Outcome vs. predictors

Multiple linear regression (MLR)

Multiple linear regression (MLR)

Multiple linear regression

Fitting the least squares line

Matrix form of multiple linear regression

Matrix form of multiple linear regression

Model fit in R

Model equation

Interpreting β^j

Interpreting β^j

Interpreting β^0

Prediction

Prediction in R

Cautions

Recap

Outcome: `interest_rate`

Interpreting ${\hat{β}}_{j}$

Interpreting ${\hat{β}}_{j}$

Interpreting ${\hat{β}}_{0}$