library(tidyverse)
library(tidymodels)
library(knitr)
library(pROC)
# load other packages as needed
HW 04: Logistic regression
This assignment is due on Thursday, April 10 at 11:59pm.
Introduction
In this assignment you will work with logistic regression models and use them to understand multivariable relationships in a variety of data contexts.
Learning goals
In this assignment, you will…
Use logistic regression to explore the relationship between a binary response variable and multiple predictor variables
Conduct exploratory data analysis for logistic regression
Interpret coefficients of logistic regression model
Use statistics to help choose the best fit model
Assess the fit of a logistic regression model
Getting started
Go to the sta221-sp25 organization on GitHub. Click on the repo with the prefix hw-03. It contains the starter documents you need to complete the lab.
Clone the repo and start a new project in RStudio. See the Lab 00 instructions for details on cloning a repo and starting a new project in R.
Packages
The following packages are used in this assignment:
Conceptual exercises
Instructions
The conceptual exercises are focused on explaining concepts and showing results mathematically. Show your work for each question.
You may write the answers and associated work for conceptual exercises by hand or type them in your Quarto document.
Exercise 1
Suppose you have the linear regression model
such that
- Find
, the MLE of . - Show that the MLE is unbiased. (Note: You must show this directly and may not use the result from part(c).)
- Show mathematically how
relates to the least-squares estimator.
Exercise 2
Suppose there are
such that
The model is reparameterized (redefined) as
such that
- Show that the MLE of
is equal to the MLE of . - Show that the MLE of
is not equal to the MLE of .
You do not need to derive the MLEs for
You do need to show your work / explain your reasoning to get the MLEs for
Exercise 31
Berry (2001) examined the effect of a player’s draft position among the pool of potential players in a given year to the probability on eventually being named an all star.
Let
Using this model, show that the odds of being named an all star are
. Then, show how to calculate based on this model.Show that the odds of being named an all star for a first draft pick are
.In the study, Berry reported that for professional basketball
and , and for professional baseball and . Explain why this suggests that (1) being a first draft pick is more crucial for being an all star in basketball than in baseball and (2) players picked in high draft positions are relatively less likely to be all stars.
Exercise 4
In the paper “Employing Standardized Risk Assessment in Pretrial Release Decisions: Association With Criminal Justice Outcomes and Racial Equity” Marlowe et al. (2020) analyze the risk predictions produced by a black-box algorithm used to determine whether a defendant is considered “high risk” of being rearrested if they are released while awaiting trial. Such algorithms are used by judges in some states to help determine whether or not defendants are released while awaiting trial.
The authors examine the algorithm’s risk predictions and whether a person was rearrested for over 500 defendants released pretrial in a southern state. For each person, the algorithm produced one of the following predictions: “High Risk” or “Low Risk”. The observed outcome was “Rearrested” (coded as 1) or “Not Rearrested” (coded as 0). Below are some results from the analysis:
- Sensitivity: 86%
- Specificity: 24%
- Positive predictive power (Precision): 57%
- Negative predictive power: 60%
Positive Predictive Power (Precision): P(Y = 1 | Y classified as 1 from the model)
Negative Predictive Power: P(Y = 0 | Y classified as 0 from the model)
Explain what each of the following mean in the context of the analysis:
Sensitivity
Specificity
Positive predictive power (Precision)
Negative predictive power
What is the false positive rate? What does this value mean in the context of the analysis?
Applied exercises
Instructions
The applied exercises are focused on applying the concepts to analyze data.
All work for the applied exercises must be typed in your Quarto document following a reproducible workflow.
Write all narrative using complete sentences and include informative axis labels / titles on visualizations.
Data: Understanding pro-environmental behavior
Ibanez and Roussel (2022) conducted an experiment to understand the impact of watching a nature documentary on pro-environmental behavior. The researchers randomly assigned the 113 participants to watch an video about architecture in NYC (control) or a video about Yellowstone National Park (treatment). As part of the experiment, participants played a game in which they had an opportunity to donate to an environmental organization.
The data set is available in nature-experiment.csv
in the data
folder. We will use the following variables:
donation_binary
:- 1 - participant donated to environmental organization
- 0 - participant did not donate
age
: Age in yearsgender
: Participant’s reported gender1 - male
0 - non-male
treatment
:- “URBAN (T1)” - the control group
- “NATURE (T2)” - the treatment group
nep_high
:- 1 - score of 4 or higher on the New Ecological Paradigm (NEP)
- 0 - score less than 4
See the Introduction and Methods sections of Ibanez and Roussel (2022) for more detail about the variables.
Click here to access the paper online.
Exercise 5
Create a visualization of the relationship between donating and treatment. Use the visualization to describe the relationship between the two variables.
Create a visualization of the relationship between donating and age. Use the visualization to describe the relationship between the two variables.
We would like to use the mean-centered value of
age
in the model. Create a new variableage_cent
that contains the mean-centered ages.
Exercise 6
Fit a logistic regression model using
age_cent
,gender
,treatment
, andnep_high
to predict the likelihood of donating. Neatly display the model using 3 digits.The researchers are most interested in the effect of watching the nature documentary. Describe the effect of
treatment
in terms of the odds of donating.What group of participants is described by the intercept? What is the predicted probability a randomly selected individual in this group donates?
Exercise 7
Produce the ROC curve for the model from the previous exercise and calculate the area under curve (AUC). Write 1 - 2 sentences describing how well the model fits the data.
Exercise 8
The authors include an interaction effect between nep_high
and treatment
in one of their models.
- Explain what an interaction between
nep_high
andtreatment
means in the context of the data. - Create a visualization to explore the potential of an interaction effect between these two variables. Based on the visualization, do you think there is an interaction effect? Briefly explain.
Exercise 9
Conduct a drop-in-deviance test to determine if the interaction between nep_high
and treatment
should be added to the model fit in Exercise 7. Include the hypotheses in mathematical notation, the output from the test, and the conclusion in the context of the data.
Submission
Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.
Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.
If you write your responses to conceptual exercises by hand, you will need to combine your written work to the completed PDF for the applied exercises before submitting on Gradescope.
Instructions to combine PDFs:
Preview (Mac): support.apple.com/guide/preview/combine-pdfs-prvw43696/mac
Adobe (Mac or PC): help.adobe.com/acrobat/using/merging-files-single-pdf.html
- Get free access to Adobe Acrobat as a Duke student: oit.duke.edu/help/articles/kb0030141/
To submit your assignment:
Access Gradescope through the menu on the STA 221 Canvas site.
Click on the assignment, and you’ll be prompted to submit it.
Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).
Select the first page of your .PDF submission to be associated with the “Workflow & formatting” section.
Grading
Component | Points |
---|---|
Ex 1 | 8 |
Ex 2 | 6 |
Ex 3 | 6 |
Ex 4 | 4 |
Ex 5 | 5 |
Ex 6 | 5 |
Ex 7 | 4 |
Ex 8 | 5 |
Ex 9 | 4 |
Workflow & formatting | 3 |
The “Workflow & formatting” grade is to assess the reproducible workflow and document format for the applied exercises. This includes having at least 3 informative commit messages, a neatly organized document with readable code and your name and the date updated in the YAML.
References
Footnotes
Exercise adapted froman exercise in Categorical Data Analysis by Agresti.↩︎