library(tidyverse)
library(tidymodels)
library(knitr)
# load other packages as needed
HW 01: Simple linear regression
Ice duration and air temperature in Madison, WI
This assignment is due on Thursday, January 30 at 11:59pm. To be considered on time, the following must be done by the due date:
- Final
.qmd
and.pdf
files pushed to your GitHub repo - Final
.pdf
file submitted on Gradescope
Introduction
You will use simple linear regression to analyze the relationship between air temperature and ice duration for two lakes in Madison, Wisconsin. You will also explore the mathematical properties of simple linear regression models.
Learning goals
In this assignment, you will…
- use matrix operations to show results about simple linear regression.
- conduct exploratory data analysis.
- fit and interpret simple linear regression models.
- evaluate model fit.
- continue developing a workflow for reproducible data analysis.
Getting started
Go to the sta221-sp25 organization on GitHub. Click on the repo with the prefix hw-01. It contains the starter documents you need to complete the lab.
Clone the repo and start a new project in RStudio. See the Lab 00 instructions for details on cloning a repo and starting a new project in R.
Packages
The following packages are used in this assignment:
Conceptual exercises1
Instructions
The conceptual exercises are focused on explaining concepts and showing results mathematically. Show your work for each question.
You may write the answers and associated work for conceptual exercises by hand or type them in your Quarto document.
Exercise 1
a. Show that the hat matrix
b. Show that
Exercise 2
Let
Show that the gradient of
(Property 2 from class)
Exercise 3
In class we used the sum of squared errors,
If the Hessian matrix
Show that
Exercise 4
Prove that the maximum value of
Exercise 5
Show that the sum of squared residuals (SSR) can be written as the following:
Applied exercises
Instructions
The applied exercises are focused on applying the concepts to analyze data.
All work for the applied exercises must be typed in your Quarto document following a reproducible workflow.
Write all narrative using complete sentences and include informative axis labels / titles on visualizations.
Data
The datasets wi-icecover.csv
and wi-air-temperature.csv
contain information about ice cover and air temperature, respectively, at Lake Monona and Lake Mendota (both in Madison, Wisconsin) for days in 1886 through 2019. The data were obtained from the ntl_icecover
and ntl_airtemp
data frames in the lterdatasampler R package. They were originally collected by the US Long Term Ecological Research program (LTER) Network.
<- read_csv("data/wi-icecover.csv")
icecover <- read_csv("data/wi-air-temperature.csv") airtemp
The analysis will focus on the following variables:
year
: year of observationlakeid
: lake nameice_duration
: number of days between the freeze and ice breakup dates of each lakeair_temp_avg
: yearly average air temperature in Madison, WI (degrees Celsius)
Analysis goal
The goal of this analysis is to use linear regression explain variability in ice duration for lakes in Madison, WI based on air temperature. Because ice cover is impacted by various environmental factors, researchers are interested in examining the association between these two factors to better understand the changing climate.
Exercise 6
Let’s start by looking at the response variable ice_duration
.
Visualize the distribution of ice duration versus year with separate lines for each lake.
There are separate yearly measurements for each lake in the
icecover
data frame. In this analysis, we will combine the data from both lakes and use the average ice duration each year.Comment on the analysis choice to use the average per year rather than the individual lake measurements. Some things to consider in your comments: Does the average accurately reflects the ice duration for these lakes in a given year year? Will there be information lost? How might that impact (or not) the analysis conclusions? Etc.
See the ggplot2 reference for example code and plots.
Exercise 7
Next, let’s combine the ice duration and air temperature data into a single analysis data frame.
Fill in the code below to create a new data frame,
icecover_avg
, of the average ice duration by year.Then join
icecover_avg
andairtemp
to create a new data frame. The new data frame should have 134 observations.<- icecover |> icecover_avg group_by(_____) |> summarise(_____) |> ungroup()
You will use the new data frame with average ice duration and average air temperature for the remainder of the assignment.
- Visualize the relationship between the air temperature and average ice duration. Do you think a linear model is a reasonable choice to model the relationship between the two variables? Briefly explain.
Now is a good time to render your document again if you haven’t done so recently and commit (with a meaningful commit message) and push all updates.
Exercise 8
We will fit a model using the average air temperature to explain variability in ice duration. The model takes the form
- State the dimensions of
, , , for this analysis. Your answer should have exact values given this data set. - Estimate the regression coefficients
in R using the matrix representation. Show the code used to get the answer. - Check your results from part (b) by using the
lm
function to fit the model. Neatly display your results using 3 digits.
Exercise 9
Calculate
for the model in the previous exercise and interpret it in the context of the data.Calculate
for the model from the previous exercise and interpret it in the context of the data.Comment on the model fit based on
and .
Exercise 10
a. Interpret the slope in the context of the data.
b. The average air temperature in 2019, the most recent year in the data set, was 7.925 degrees Celsius. What was the predicted ice duration for 2019? What is the residual?
Submission
Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.
Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.
If you write your responses to conceptual exercises by hand, you will need to combine your written work to the completed PDF for the applied exercises before submitting on Gradescope.
Instructions to combine PDFs:
Preview (Mac): support.apple.com/guide/preview/combine-pdfs-prvw43696/mac
Adobe (Mac or PC): helpx.adobe.com/acrobat/using/merging-files-single-pdf.html
- Get free access to Adobe Acrobat as a Duke student: oit.duke.edu/help/articles/kb0030141/
To submit your assignment:
Access Gradescope through the menu on the STA 221 Canvas site.
Click on the assignment, and you’ll be prompted to submit it.
Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).
Select the first page of your .PDF submission to be associated with the “Workflow & formatting” section.
Grading (50 points)
Component | Points |
---|---|
Ex 1 | 4 |
Ex 2 | 5 |
Ex 3 | 4 |
Ex 4 | 4 |
Ex 5 | 4 |
Ex 6 | 5 |
Ex 7 | 5 |
Ex 8 | 6 |
Ex 9 | 5 |
Ex 10 | 4 |
Workflow & formatting | 4 |
The “Workflow & formatting” grade is to assess the reproducible workflow and document format for the applied exercises. This includes having at least 3 informative commit messages, a neatly organized document with readable code and your name and the date updated in the YAML.
Footnotes
Exercise 4 is adapted from Montgomery, Peck, and Vining (2021) .↩︎