library(tidyverse)
library(tidymodels)
library(knitr)
library(rms)
# load other packages as needed
HW 03: Conditions and variable transformations
This assignment is due on Thursday, March 20 at 11:59pm.
Introduction
In this assignment you will use linear regression to explore the relationship between multiple variables. You will also examine model diagnostics and variable transformations.
Learning goals
In this assignment, you will…
- use model diagnostics to identify influential points.
- examine multicollinearity and consider strategies to handle it.
- fit and interpret models with transformed variables.
Getting started
Go to the sta221-sp25 organization on GitHub. Click on the repo with the prefix hw-03. It contains the starter documents you need to complete the lab.
Clone the repo and start a new project in RStudio. See the Lab 00 instructions for details on cloning a repo and starting a new project in R.
Packages
The following packages are used in this assignment:
Conceptual exercises1
Instructions
The conceptual exercises are focused on explaining concepts and showing results mathematically. Show your work for each question.
You may write the answers and associated work for conceptual exercises by hand or type them in your Quarto document.
Exercise 1
Suppose we have a model of the form
Describe the expected change in
Exercise 2
Suppose we have a model of the form
- This model violates which model assumption? Briefly explain why.
- Suppose you refit the model with the transformation on
, . Show that this is a variance-stabilizing transformation, i.e., that the variance of the response does not depend on .
Exercise 3
For each of the following regression models, state whether it can be expressed in the form of a linear model by applying a suitable transformation to both sides of the equation. If so, write the equation for the transformed model.
Applied exercises
Instructions
The applied exercises are focused on applying the concepts to analyze data.
All work for the applied exercises must be typed in your Quarto document following a reproducible workflow.
Write all narrative using complete sentences and include informative axis labels / titles on visualizations.
Data: Age of abalones
The data for this analysis contains measurements for abalones, a type of marine snail. These measurements were collected and analyzed by researchers in Warwick et al. (1994). Click here for the publication.
The 4177 abalones in this study can be reasonably treated as a random sample.
The data are available in the file abalone.csv
in the data
folder. This analysis will focus on the following variables:
Sex
: Male (M), Female (F), Infant (I)Length
: Longest shell measurement (in millimeters)Diameter
: Measured perpendicular to length (in millimeters)Height
: Measured with meat in shell (in millimeters)Whole_Weight
: Total weight of abalone (in grams)Age
: Age (in year)
The goal of the analysis is to use a variety of measurements from abalones to explain variability in the age.
Exercise 4
- Fit a model using
Sex
,Length
,Diameter
,Height
andWhole_Weight
to understand variability inAge
. Neatly display the model using 3 digits. - Check the four model conditions - Linearity, Constant Variance, Normality, and Independence. For each condition: (1) state whether or not it is satisfied; (2) explain your response showing any visualizations and/or statistics used to make your assessment.
Exercise 5
Now let’s take a look at the model diagnostics.
- Are there any influential observations in the data set? Briefly explain, showing any work or output used to make the determination.
- Consider the observation with the highest value for Cook’s distance. What is the value of leverage for this observation? Does this observation have large leverage? Briefly explain, showing any work or output used to make the determination.
- Again consider the observation with the highest value for Cook’s distance. What is the standardized residual for this observation? Is this observation an outlier? Briefly explain showing any work or output used to make the determination.
Exercise 6
Now let’s look at the relationship between predictors.
- Compute the Variance Inflation Factors (VIF) for the model from Exercise 4. Display the results.
- Use the equation for VIF to “manually” compute the VIF for
Whole_Weight
. - What predictors appear to be collinear?
- Select a strategy to fit a model that does not have an issue with multicollinearity.
- Briefly describe your strategy.
- Select a final model.
- Briefly explain your selection, showing the work and statistics used to choose a final model.
Data: 2000 U.S. Presidential Election2
We will examine data about the 2000 U.S. presidential election between George W. Bush and Al Gore. It was one of the closest elections in history that ultimately came down to the state of Florida. One county in particular, Palm Beach County, was at the center of the controversy due to the design of their ballots - the infamous butterfly ballots. It is believed that many people who intended to vote for Al Gore accidentally voted for Pat Buchanan due to how the spots to mark the candidate were arranged next to the names.
The variables in the data are
County
: County nameBush2000
: Number of votes for George W. BushBuchanan2000
: Number of votes for Pat Buchanan
The data are available in the file florida-votes-2000.csv
in the data
folder of your repo.
Exercise 7
The goal is to fit a model that uses the number of votes for Bush to predict the number of votes for Buchanan. Using this model, we’ll investigate whether the data support the claim that votes for Gore may have accidentally gone to Buchanan.
- Visualize the relationship between the number of votes for Buchanan versus the number of votes for Bush. Describe what you observe in the visualization, including a description of the relationship between the votes for Buchanan and votes for Bush.
- What is the county with the extreme outlier number of votes for Buchanan? Create a new data frame that doesn’t include the outlying county. You will use this updated data frame for the remainder of this exercise and Exercise 8.
Exercise 8
Now let’s consider potential models with transformations on the response and/or predictor variables. The four candidate models are the following:
Model | Response variable | Predictor variable |
---|---|---|
1 | Buchanan2000 | Bush2000 |
2 | log(Buchanan2000) | Bush2000 |
3 | Buchanan2000 | log(Bush2000) |
4 | log(Buchanan2000) | log(Bush2000) |
Which model best fits the data? Briefly explain, showing any work and output used to determine the response. (Note: Use the data set without the outlying county to find the candidate models.)
Exercise 9
Now we will use the model to predict the expected number of Buchanan votes for the outlier county.
Suppose the observed value of the predictor for this county (a new observation) is
Then the predicted response is
Where
Just as there is uncertainty in our model coefficients, there is uncertainty in our predictions as well. We use a confidence interval to quantify the uncertainty for a model coefficient, and we can use a prediction interval to quantify the uncertainty in the prediction for a new observation.
The
where
Use the model you chose in the previous exercise to compute the predicted number of votes for Buchanan in the outlying county identified in Exercise 7. If you selected a model with a transformation, be sure to report your answer in terms of votes, not log(votes).
Use the formula above to “manually” compute the 95% prediction interval for this county (do not obtain the interval using the
predict
function) . If you selected a model with a transformation, be sure to report your answer in terms of votes, not log(votes).It is assumed that some of the votes for Buchanan in that county were actually intended to be for Gore. Based on your results in the previous question, does your model support this claim?
If no, briefly explain.
If yes, about how many votes were possibly intended for Gore? Show any calculations and output used to determine your answer. If you selected a model with a transformation, be sure to report your answer in terms of votes, not log(votes).
Submission
Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.
Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.
If you write your responses to conceptual exercises by hand, you will need to combine your written work to the completed PDF for the applied exercises before submitting on Gradescope.
Instructions to combine PDFs:
Preview (Mac): support.apple.com/guide/preview/combine-pdfs-prvw43696/mac
Adobe (Mac or PC): helpx.adobe.com/acrobat/using/merging-files-single-pdf.html
- Get free access to Adobe Acrobat as a Duke student: oit.duke.edu/help/articles/kb0030141/
To submit your assignment:
Access Gradescope through the menu on the STA 221 Canvas site.
Click on the assignment, and you’ll be prompted to submit it.
Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).
Select the first page of your .PDF submission to be associated with the “Workflow & formatting” section.
Grading
Component | Points |
---|---|
Ex 1 | 4 |
Ex 2 | 4 |
Ex 3 | 4 |
Ex 4 | 6 |
Ex 5 | 6 |
Ex 6 | 6 |
Ex 7 | 6 |
Ex 8 | 5 |
Ex 9 | 6 |
Workflow & formatting | 3 |
The “Workflow & formatting” grade is to assess the reproducible workflow and document format for the applied exercises. This includes having at least 3 informative commit messages, a neatly organized document with readable code and your name and the date updated in the YAML.
References
Footnotes
Exercise 2 is adapted from Montgomery, Peck, and Vining (2021) .↩︎
This analysis was motivated by exercises in (ledolter2003statistical?).↩︎