HW 03: Conditions and variable transformations

Due date

This assignment is due on Thursday, March 20 at 11:59pm.

Introduction

In this assignment you will use linear regression to explore the relationship between multiple variables. You will also examine model diagnostics and variable transformations.

Learning goals

In this assignment, you will…

use model diagnostics to identify influential points.
examine multicollinearity and consider strategies to handle it.
fit and interpret models with transformed variables.

Getting started

Go to the sta221-sp25 organization on GitHub. Click on the repo with the prefix hw-03. It contains the starter documents you need to complete the lab.
Clone the repo and start a new project in RStudio. See the Lab 00 instructions for details on cloning a repo and starting a new project in R.

Packages

The following packages are used in this assignment:

library(tidyverse)
library(tidymodels)
library(knitr)
library(rms)

# load other packages as needed

Conceptual exercises¹

Instructions

The conceptual exercises are focused on explaining concepts and showing results mathematically. Show your work for each question.

You may write the answers and associated work for conceptual exercises by hand or type them in your Quarto document.

Exercise 1

Suppose we have a model of the form

$\log (y_{i}) = β_{0} + β_{1} \log (x_{i}) + ϵ_{i} ϵ_{i} \sim N (0, σ_{ϵ}^{2})$

Describe the expected change in $y_{i}$ when $x_{i}$ is multiplied by a constant $C$ . Show the work used to obtain the expected change.

Exercise 2

Suppose we have a model of the form

$y_{i} = β_{0} + β_{1} x_{i} + ϵ_{i}, ϵ_{i} \sim N (0, σ_{ϵ}^{2} x_{i}^{2})$

This model violates which model assumption? Briefly explain why.
Suppose you refit the model with the transformation on $y$ , $y^{'} = y / x$ . Show that this is a variance-stabilizing transformation, i.e., that the variance of the response does not depend on $x$ .

Exercise 3

For each of the following regression models, state whether it can be expressed in the form of a linear model by applying a suitable transformation to both sides of the equation. If so, write the equation for the transformed model.

$y_{i} = \log (β_{1} x_{i 1}) + β_{2} x_{i 2} + ϵ_{i}$
$y_{i} = [1 + e^{(β_{0} + β_{1} x_{i 1} + ϵ_{i})}]^{- 1}$

Applied exercises

Instructions

The applied exercises are focused on applying the concepts to analyze data.

All work for the applied exercises must be typed in your Quarto document following a reproducible workflow.

Write all narrative using complete sentences and include informative axis labels / titles on visualizations.

Data: Age of abalones

The data for this analysis contains measurements for abalones, a type of marine snail. These measurements were collected and analyzed by researchers in Warwick et al. (1994). Click here for the publication.

The 4177 abalones in this study can be reasonably treated as a random sample.

The data are available in the file abalone.csv in the data folder. This analysis will focus on the following variables:

Sex: Male (M), Female (F), Infant (I)
Length: Longest shell measurement (in millimeters)
Diameter: Measured perpendicular to length (in millimeters)
Height : Measured with meat in shell (in millimeters)
Whole_Weight: Total weight of abalone (in grams)
Age: Age (in year)

The goal of the analysis is to use a variety of measurements from abalones to explain variability in the age.

Exercise 4

Fit a model using Sex, Length, Diameter, Height and Whole_Weight to understand variability in Age. Neatly display the model using 3 digits.
Check the four model conditions - Linearity, Constant Variance, Normality, and Independence. For each condition: (1) state whether or not it is satisfied; (2) explain your response showing any visualizations and/or statistics used to make your assessment.

Exercise 5

Now let’s take a look at the model diagnostics.

Are there any influential observations in the data set? Briefly explain, showing any work or output used to make the determination.
Consider the observation with the highest value for Cook’s distance. What is the value of leverage for this observation? Does this observation have large leverage? Briefly explain, showing any work or output used to make the determination.
Again consider the observation with the highest value for Cook’s distance. What is the standardized residual for this observation? Is this observation an outlier? Briefly explain showing any work or output used to make the determination.

Exercise 6

Now let’s look at the relationship between predictors.

Compute the Variance Inflation Factors (VIF) for the model from Exercise 4. Display the results.
Use the equation for VIF to “manually” compute the VIF for Whole_Weight.
What predictors appear to be collinear?
Select a strategy to fit a model that does not have an issue with multicollinearity.
- Briefly describe your strategy.
- Select a final model.
- Briefly explain your selection, showing the work and statistics used to choose a final model.

Data: 2000 U.S. Presidential Election²

We will examine data about the 2000 U.S. presidential election between George W. Bush and Al Gore. It was one of the closest elections in history that ultimately came down to the state of Florida. One county in particular, Palm Beach County, was at the center of the controversy due to the design of their ballots - the infamous butterfly ballots. It is believed that many people who intended to vote for Al Gore accidentally voted for Pat Buchanan due to how the spots to mark the candidate were arranged next to the names.

The variables in the data are

County: County name
Bush2000: Number of votes for George W. Bush
Buchanan2000: Number of votes for Pat Buchanan

The data are available in the file florida-votes-2000.csv in the data folder of your repo.

Exercise 7

The goal is to fit a model that uses the number of votes for Bush to predict the number of votes for Buchanan. Using this model, we’ll investigate whether the data support the claim that votes for Gore may have accidentally gone to Buchanan.

Visualize the relationship between the number of votes for Buchanan versus the number of votes for Bush. Describe what you observe in the visualization, including a description of the relationship between the votes for Buchanan and votes for Bush.
What is the county with the extreme outlier number of votes for Buchanan? Create a new data frame that doesn’t include the outlying county. You will use this updated data frame for the remainder of this exercise and Exercise 8.

Exercise 8

Now let’s consider potential models with transformations on the response and/or predictor variables. The four candidate models are the following:

Model	Response variable	Predictor variable
1	Buchanan2000	Bush2000
2	log(Buchanan2000)	Bush2000
3	Buchanan2000	log(Bush2000)
4	log(Buchanan2000)	log(Bush2000)

Which model best fits the data? Briefly explain, showing any work and output used to determine the response. (Note: Use the data set without the outlying county to find the candidate models.)

Exercise 9

Now we will use the model to predict the expected number of Buchanan votes for the outlier county.

Suppose the observed value of the predictor for this county (a new observation) is $x_{0}$ . We define $x_{0}^{T} = [1, x_{0}]$

Then the predicted response is

${\hat{y}}_{0} = x_{0}^{T} \hat{β}$

Where $\hat{β}$ is the vector of estimated model coefficients.

Just as there is uncertainty in our model coefficients, there is uncertainty in our predictions as well. We use a confidence interval to quantify the uncertainty for a model coefficient, and we can use a prediction interval to quantify the uncertainty in the prediction for a new observation.

The $C %$ prediction interval for the new observation is

${\hat{y}}_{0} \pm t_{n - p - 1}^{*} \sqrt{{\hat{σ}}_{ϵ}^{2} (1 + x_{0}^{T} (X^{T} X)^{- 1} x_{0})}$

where $t_{n - p - 1}^{*}$ is the critical value obtained from the $t$ distribution with $n - p - 1$ degrees of freedom, $X$ is the design matrix for the model, and ${\hat{σ}}_{ϵ}^{2}$ is the estimated variability about the regression line.

Use the model you chose in the previous exercise to compute the predicted number of votes for Buchanan in the outlying county identified in Exercise 7. If you selected a model with a transformation, be sure to report your answer in terms of votes, not log(votes).
Use the formula above to “manually” compute the 95% prediction interval for this county (do not obtain the interval using the predict function) . If you selected a model with a transformation, be sure to report your answer in terms of votes, not log(votes).
It is assumed that some of the votes for Buchanan in that county were actually intended to be for Gore. Based on your results in the previous question, does your model support this claim?
- If no, briefly explain.
- If yes, about how many votes were possibly intended for Gore? Show any calculations and output used to determine your answer. If you selected a model with a transformation, be sure to report your answer in terms of votes, not log(votes).

Submission

Warning

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.

If you write your responses to conceptual exercises by hand, you will need to combine your written work to the completed PDF for the applied exercises before submitting on Gradescope.

Instructions to combine PDFs:

Preview (Mac): support.apple.com/guide/preview/combine-pdfs-prvw43696/mac
Adobe (Mac or PC): helpx.adobe.com/acrobat/using/merging-files-single-pdf.html
- Get free access to Adobe Acrobat as a Duke student: oit.duke.edu/help/articles/kb0030141/

To submit your assignment:

Access Gradescope through the menu on the STA 221 Canvas site.
Click on the assignment, and you’ll be prompted to submit it.
Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).
Select the first page of your .PDF submission to be associated with the “Workflow & formatting” section.

Grading

Component	Points
Ex 1	4
Ex 2	4
Ex 3	4
Ex 4	6
Ex 5	6
Ex 6	6
Ex 7	6
Ex 8	5
Ex 9	6
Workflow & formatting	3

The “Workflow & formatting” grade is to assess the reproducible workflow and document format for the applied exercises. This includes having at least 3 informative commit messages, a neatly organized document with readable code and your name and the date updated in the YAML.

References

Montgomery, Douglas C, Elizabeth A Peck, and G Geoffrey Vining. 2021. Introduction to Linear Regression Analysis. John Wiley & Sons.

Warwick, JN, TL Sellers, SR Talbot, AJ Cawthorn, and WB Ford. 1994. “The Population Biology of Abalone (Haliotis Species) in Tasmania. I. Blacklip Abalone (h. Rubra) from the North Coast and Islands of Bass Strait. Sea Fisheries Division.” Sea Fisheries Division, Technical, no. 48.

Footnotes

Exercise 2 is adapted from Montgomery, Peck, and Vining (2021) .↩︎
This analysis was motivated by exercises in (ledolter2003statistical?).↩︎

Introduction

Learning goals

Getting started

Packages

Conceptual exercises1

Instructions

Exercise 1

Exercise 2

Exercise 3

Applied exercises

Instructions

Data: Age of abalones

Exercise 4

Exercise 5

Exercise 6

Data: 2000 U.S. Presidential Election2

Exercise 7

Exercise 8

Exercise 9

Submission

Grading

References

Footnotes

Conceptual exercises¹

Data: 2000 U.S. Presidential Election²