HW 04: Logistic regression

Due date

This assignment is due on Thursday, April 10 at 11:59pm.

Introduction

In this assignment you will work with logistic regression models and use them to understand multivariable relationships in a variety of data contexts.

Learning goals

In this assignment, you will…

  • Use logistic regression to explore the relationship between a binary response variable and multiple predictor variables

  • Conduct exploratory data analysis for logistic regression

  • Interpret coefficients of logistic regression model

  • Use statistics to help choose the best fit model

  • Assess the fit of a logistic regression model

Getting started

  • Go to the sta221-sp25 organization on GitHub. Click on the repo with the prefix hw-03. It contains the starter documents you need to complete the lab.

  • Clone the repo and start a new project in RStudio. See the Lab 00 instructions for details on cloning a repo and starting a new project in R.

Packages

The following packages are used in this assignment:

library(tidyverse)
library(tidymodels)
library(knitr)
library(pROC)

# load other packages as needed

Conceptual exercises

Instructions

The conceptual exercises are focused on explaining concepts and showing results mathematically. Show your work for each question.

You may write the answers and associated work for conceptual exercises by hand or type them in your Quarto document.

Exercise 1

Suppose you have the linear regression model

yi=βxi+ϵiϵiN(0,σϵ2)

such that ϵi are i.i.d. and there is no intercept.

  1. Find β~ , the MLE of β.
  2. Show that the MLE is unbiased. (Note: You must show this directly and may not use the result from part(c).)
  3. Show mathematically how β~ relates to the least-squares estimator.

Exercise 2

Suppose there are n observations, such that each yi is generated from xi based on the linear model

yi=β0+β1xi+ϵi,ϵiN(0,σϵ2)

such that ϵi are i.i.d.

The model is reparameterized (redefined) as

yi=β0+β1(xix¯)+ϵi

such that ϵi follows the same distribution as the original model.

  1. Show that the MLE of β1 is equal to the MLE of β1.
  2. Show that the MLE of β0 is not equal to the MLE of β0.
Tip

You do not need to derive the MLEs for β0 and β1. You may use the results from the notes.

You do need to show your work / explain your reasoning to get the MLEs for β0 and β1 .

Exercise 3

Berry () examined the effect of a player’s draft position among the pool of potential players in a given year to the probability on eventually being named an all star.

Let d be the draft position (d=1,2,3,) and π be the probability of eventually being named an all star. The researcher modeled the relationship between d and π using the following model:

log(πi1πi)=β0+β1logdi

  1. Using this model, show that the odds of being named an all star are eβ0dβ1 . Then, show how to calculate πi based on this model.

  2. Show that the odds of being named an all star for a first draft pick are eβ0 .

  3. In the study, Berry reported that for professional basketball β^0=2.3 and β^1=1.1, and for professional baseball β^0=0.7 and β^1=0.6 . Explain why this suggests that (1) being a first draft pick is more crucial for being an all star in basketball than in baseball and (2) players picked in high draft positions are relatively less likely to be all stars.

Exercise 4

In the paper Employing Standardized Risk Assessment in Pretrial Release Decisions: Association With Criminal Justice Outcomes and Racial Equity” Marlowe et al. () analyze the risk predictions produced by a black-box algorithm used to determine whether a defendant is considered “high risk” of being rearrested if they are released while awaiting trial. Such algorithms are used by judges in some states to help determine whether or not defendants are released while awaiting trial.

The authors examine the algorithm’s risk predictions and whether a person was rearrested for over 500 defendants released pretrial in a southern state. For each person, the algorithm produced one of the following predictions: “High Risk” or “Low Risk”. The observed outcome was “Rearrested” (coded as 1) or “Not Rearrested” (coded as 0). Below are some results from the analysis:

  • Sensitivity: 86%
  • Specificity: 24%
  • Positive predictive power (Precision): 57%
  • Negative predictive power: 60%
Tip
  • Positive Predictive Power (Precision): P(Y = 1 | Y classified as 1 from the model)

  • Negative Predictive Power: P(Y = 0 | Y classified as 0 from the model)

  1. Explain what each of the following mean in the context of the analysis:

    • Sensitivity

    • Specificity

    • Positive predictive power (Precision)

    • Negative predictive power

  2. What is the false positive rate? What does this value mean in the context of the analysis?

Applied exercises

Instructions

The applied exercises are focused on applying the concepts to analyze data.

All work for the applied exercises must be typed in your Quarto document following a reproducible workflow.

Write all narrative using complete sentences and include informative axis labels / titles on visualizations.

Data: Understanding pro-environmental behavior

Ibanez and Roussel () conducted an experiment to understand the impact of watching a nature documentary on pro-environmental behavior. The researchers randomly assigned the 113 participants to watch an video about architecture in NYC (control) or a video about Yellowstone National Park (treatment). As part of the experiment, participants played a game in which they had an opportunity to donate to an environmental organization.

The data set is available in nature-experiment.csv in the data folder. We will use the following variables:

  • donation_binary:

    • 1 - participant donated to environmental organization
    • 0 - participant did not donate
  • age: Age in years

  • gender: Participant’s reported gender

    • 1 - male

    • 0 - non-male

  • treatment:

    • “URBAN (T1)” - the control group
    • “NATURE (T2)” - the treatment group
  • nep_high:

    • 1 - score of 4 or higher on the New Ecological Paradigm (NEP)
    • 0 - score less than 4
Tip

See the Introduction and Methods sections of Ibanez and Roussel () for more detail about the variables.

Click here to access the paper online.

Exercise 5

  1. Create a visualization of the relationship between donating and treatment. Use the visualization to describe the relationship between the two variables.

  2. Create a visualization of the relationship between donating and age. Use the visualization to describe the relationship between the two variables.

  3. We would like to use the mean-centered value of age in the model. Create a new variable age_cent that contains the mean-centered ages.

Exercise 6

  1. Fit a logistic regression model using age_cent, gender, treatment, and nep_high to predict the likelihood of donating. Neatly display the model using 3 digits.

  2. The researchers are most interested in the effect of watching the nature documentary. Describe the effect of treatment in terms of the odds of donating.

  3. What group of participants is described by the intercept? What is the predicted probability a randomly selected individual in this group donates?

Exercise 7

Produce the ROC curve for the model from the previous exercise and calculate the area under curve (AUC). Write 1 - 2 sentences describing how well the model fits the data.

Exercise 8

The authors include an interaction effect between nep_high and treatment in one of their models.

  1. Explain what an interaction between nep_high and treatment means in the context of the data.
  2. Create a visualization to explore the potential of an interaction effect between these two variables. Based on the visualization, do you think there is an interaction effect? Briefly explain.

Exercise 9

Conduct a drop-in-deviance test to determine if the interaction between nep_high and treatment should be added to the model fit in Exercise 7. Include the hypotheses in mathematical notation, the output from the test, and the conclusion in the context of the data.

Submission

Warning

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.

If you write your responses to conceptual exercises by hand, you will need to combine your written work to the completed PDF for the applied exercises before submitting on Gradescope.

Instructions to combine PDFs:

To submit your assignment:

  • Access Gradescope through the menu on the STA 221 Canvas site.

  • Click on the assignment, and you’ll be prompted to submit it.

  • Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).

  • Select the first page of your .PDF submission to be associated with the “Workflow & formatting” section.

Grading

Component Points
Ex 1 8
Ex 2 6
Ex 3 6
Ex 4 4
Ex 5 5
Ex 6 5
Ex 7 4
Ex 8 5
Ex 9 4
Workflow & formatting 3

The “Workflow & formatting” grade is to assess the reproducible workflow and document format for the applied exercises. This includes having at least 3 informative commit messages, a neatly organized document with readable code and your name and the date updated in the YAML.

References

Berry, Scott M. 2001. “A Statistician Reads the Sports Pages: Luck in Sports.” Chance 14 (1): 52–57.
Ibanez, Lisette, and Sébastien Roussel. 2022. “The Impact of Nature Video Exposure on Pro-Environmental Behavior: An Experimental Investigation.” Plos One 17 (11): e0275806.
Marlowe, Douglas B, Timothy Ho, Shannon M Carey, and Carly D Chadick. 2020. “Employing Standardized Risk Assessment in Pretrial Release Decisions: Association with Criminal Justice Outcomes and Racial Equity.” Law and Human Behavior 44 (5): 361.

Footnotes

  1. Exercise adapted froman exercise in Categorical Data Analysis by Agresti.↩︎