library(tidyverse)
library(tidymodels)
library(openintro)
library(knitr)
AE 02: Multiple linear regression
Peer-to-peer lending
Go to the course GitHub organization and locate your ae-02
repo to get started.
Render, commit, and push your responses to GitHub by the end of class to submit your AE.
Packages
Data
Today’s data is a sample of 50 loans made through a peer-to-peer lending club. The data is in the loan50
data frame in the openintro R package.
We will focus on the following variables:
annual_income_th
: Annual income (in $1000s)debt_to_income
: Debt-to-income ratio, i.e. the percentage of a borrower’s total debt divided by their total incomeverified_income
: Whether borrower’s income source and amount have been verified (Not Verified
,Source Verified
,Verified
)interest_rate
: Interest rate for the loan
The goal of this analysis is to use the annual income, debt-to-income ratio, and income verification to understand variability in the interest rate on the loan.
We’ll start with data prep to rescale annual income to $1000’s and recode verified_income
to fix an issue with the underlying data.
<- loan50 |>
loan50 mutate(annual_income_th = annual_income / 1000,
verified_income =
case_when(verified_income == "Not Verified" ~ "Not Verified",
== "Source Verified" ~ "Source Verified",
verified_income == "Verified" ~ "Verified"),
verified_income verified_income = as_factor(verified_income)
)
glimpse(loan50)
Rows: 50
Columns: 19
$ state <fct> NJ, CA, SC, CA, OH, IN, NY, MO, FL, FL, MD, HI…
$ emp_length <dbl> 3, 10, NA, 0, 4, 6, 2, 10, 6, 3, 8, 10, 10, 2,…
$ term <dbl> 60, 36, 36, 36, 60, 36, 36, 36, 60, 60, 36, 36…
$ homeownership <fct> rent, rent, mortgage, rent, mortgage, mortgage…
$ annual_income <dbl> 59000, 60000, 75000, 75000, 254000, 67000, 288…
$ verified_income <fct> Not Verified, Not Verified, Verified, Not Veri…
$ debt_to_income <dbl> 0.55752542, 1.30568333, 1.05628000, 0.57434667…
$ total_credit_limit <int> 95131, 51929, 301373, 59890, 422619, 349825, 1…
$ total_credit_utilized <int> 32894, 78341, 79221, 43076, 60490, 72162, 2872…
$ num_cc_carrying_balance <int> 8, 2, 14, 10, 2, 4, 1, 3, 10, 4, 3, 4, 3, 2, 3…
$ loan_purpose <fct> debt_consolidation, credit_card, debt_consolid…
$ loan_amount <int> 22000, 6000, 25000, 6000, 25000, 6400, 3000, 1…
$ grade <fct> B, B, E, B, B, B, D, A, A, C, D, A, A, A, A, E…
$ interest_rate <dbl> 10.90, 9.92, 26.30, 9.92, 9.43, 9.92, 17.09, 6…
$ public_record_bankrupt <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
$ loan_status <fct> Current, Current, Current, Current, Current, C…
$ has_second_income <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ total_income <dbl> 59000, 60000, 75000, 75000, 254000, 67000, 288…
$ annual_income_th <dbl> 59.0, 60.0, 75.0, 75.0, 254.0, 67.0, 28.8, 80.…
Part 1
Exercise 1
We’ll start by fitting a model in which we include all levels of verified_income
.
Fit a model using
debt_to_income
,annual_income_th
, and the indicator variables created below to predictinterest_rate
.What do you notice about the model output? Why did this happen?
<- loan50 |>
loan50 mutate(
not_verified = factor(if_else(verified_income == "Not Verified", 1, 0)),
source_verified = factor(if_else(verified_income == "Source Verified", 1, 0)),
verified = factor(if_else(verified_income == "Verified", 1, 0))
)
# add code here
Exercise 2
Now let’s take a look at the design matrix for the model with predictors debt_to_income
, annual_income_th
, and verified_income
.
How does R choose the baseline level by default?
## add code here
Exercise 3
What is the intercept for individuals with
Not verified income?
Source verified income?
Verified income?
Part 2
Exercise 4
Fit the model with the predictors debt_to_income
, annual_income_th
, verified_income
, and the interaction between annual_income_th
and verified_income
.
Neatly display the model results using 3 digits.
# add code here
Exercise 5
Write the estimated regression equation for the people with
Not Verified
income.Write the estimated regression equation for people with
Verified
income.
Exercise 6
In general, how do
indicators for categorical predictors impact the model equation?
interaction terms impact the model equation?
LaTex
Sometimes, you will need to include mathematical notation in your document. There are two ways you can display mathematics in your document:
Inline: Your mathematics will display within the line of text.
Use
$
to start and end your LaTex syntax. You can also use the menu: Insert -> LaTex Math -> Inline Math.Example: The text
The simple linear regression model is $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$
produces
The simple linear regression model is
Displayed: Your mathematics will display outside the line of text
Use a
$$
to start and end your LaTex syntax. You can also use the menu: Insert -> LaTex Math -> Display Math.Example: The text
The estimated regression equation is $$\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}$$
produces
The estimated regression equation is
Click here for a quick reference of LaTex code.
Submission
To submit the AE:
- Render the document to produce the PDF with all of your work from today’s class.
- Push all your work to your AE repo on GitHub. You’re done! 🎉