HW 01: Simple linear regression

Ice duration and air temperature in Madison, WI

Due date

This assignment is due on Thursday, January 30 at 11:59pm. To be considered on time, the following must be done by the due date:

Final .qmd and .pdf files pushed to your GitHub repo
Final .pdf file submitted on Gradescope

Introduction

You will use simple linear regression to analyze the relationship between air temperature and ice duration for two lakes in Madison, Wisconsin. You will also explore the mathematical properties of simple linear regression models.

Learning goals

In this assignment, you will…

use matrix operations to show results about simple linear regression.
conduct exploratory data analysis.
fit and interpret simple linear regression models.
evaluate model fit.
continue developing a workflow for reproducible data analysis.

Getting started

Go to the sta221-sp25 organization on GitHub. Click on the repo with the prefix hw-01. It contains the starter documents you need to complete the lab.
Clone the repo and start a new project in RStudio. See the Lab 00 instructions for details on cloning a repo and starting a new project in R.

Packages

The following packages are used in this assignment:

library(tidyverse)
library(tidymodels)
library(knitr)

# load other packages as needed

Conceptual exercises¹

Instructions

The conceptual exercises are focused on explaining concepts and showing results mathematically. Show your work for each question.

You may write the answers and associated work for conceptual exercises by hand or type them in your Quarto document.

Exercise 1

a. Show that the hat matrix $H$ is symmetric $(H^{T} = H)$ and idempotent $(H^{2} = H)$ .

b. Show that $(I - H)$ is symmetric and idempotent.

Exercise 2

Let $x$ be a $k \times 1$ vector and $A$ be a symmetric $k \times k$ matrix, such that $A$ is not a function of $x$ .

Show that the gradient of $x^{T} A x$ with respect to $x$ is

$\nabla_{x} x^{T} A x = 2 A x$

(Property 2 from class)

Exercise 3

In class we used the sum of squared errors, $ϵ^{T} ϵ$ , to estimate the regression coefficients, $\hat{β} = (X^{T} X)^{- 1} X^{T} Y$ . To show this is the least squares estimate, we now need to show that we have, in fact, found the estimate of $β$ that minimizes the sum of

If the Hessian matrix $\nabla_{β}^{2} ϵ^{T} ϵ$ is positive definite, then we know we have found the $\hat{β}$ that minimizes the sum of squared errors, i.e., the least squares estimator.

Show that $\nabla_{β}^{2} ϵ^{T} ϵ \propto X^{T} X$ is positive definite.

Exercise 4

Prove that the maximum value of $R^{2}$ must be less than 1 if the data set contains observations such that there are different observed values of the response for the same value of the predictor (e.g., the data set contains observations $(x_{i}, y_{i})$ and $(x_{j}, y_{j})$ such that $x_{i} = x_{j}$ and $y_{i} \neq y_{j}$ ).

Exercise 5

Show that the sum of squared residuals (SSR) can be written as the following:

$y^{T} y - {\hat{β}}^{T} X^{T} y$

Applied exercises

Instructions

The applied exercises are focused on applying the concepts to analyze data.

All work for the applied exercises must be typed in your Quarto document following a reproducible workflow.

Write all narrative using complete sentences and include informative axis labels / titles on visualizations.

Data

The datasets wi-icecover.csv and wi-air-temperature.csv contain information about ice cover and air temperature, respectively, at Lake Monona and Lake Mendota (both in Madison, Wisconsin) for days in 1886 through 2019. The data were obtained from the ntl_icecover and ntl_airtemp data frames in the lterdatasampler R package. They were originally collected by the US Long Term Ecological Research program (LTER) Network.

icecover <- read_csv("data/wi-icecover.csv")
airtemp <- read_csv("data/wi-air-temperature.csv")

The analysis will focus on the following variables:

year: year of observation
lakeid: lake name
ice_duration: number of days between the freeze and ice breakup dates of each lake
air_temp_avg: yearly average air temperature in Madison, WI (degrees Celsius)

Analysis goal

The goal of this analysis is to use linear regression explain variability in ice duration for lakes in Madison, WI based on air temperature. Because ice cover is impacted by various environmental factors, researchers are interested in examining the association between these two factors to better understand the changing climate.

Exercise 6

Let’s start by looking at the response variable ice_duration.

Visualize the distribution of ice duration versus year with separate lines for each lake.
There are separate yearly measurements for each lake in the icecover data frame. In this analysis, we will combine the data from both lakes and use the average ice duration each year.

Comment on the analysis choice to use the average per year rather than the individual lake measurements. Some things to consider in your comments: Does the average accurately reflects the ice duration for these lakes in a given year year? Will there be information lost? How might that impact (or not) the analysis conclusions? Etc.

Tip

See the ggplot2 reference for example code and plots.

Exercise 7

Next, let’s combine the ice duration and air temperature data into a single analysis data frame.

Fill in the code below to create a new data frame, icecover_avg, of the average ice duration by year.

Then join icecover_avg and airtemp to create a new data frame. The new data frame should have 134 observations.
```
icecover_avg <- icecover |>
  group_by(_____) |>
  summarise(_____) |>
  ungroup()
```

Important

You will use the new data frame with average ice duration and average air temperature for the remainder of the assignment.

Visualize the relationship between the air temperature and average ice duration. Do you think a linear model is a reasonable choice to model the relationship between the two variables? Briefly explain.

Now is a good time to render your document again if you haven’t done so recently and commit (with a meaningful commit message) and push all updates.

Exercise 8

We will fit a model using the average air temperature to explain variability in ice duration. The model takes the form

$y = X β + ϵ$

State the dimensions of $y$ , $X$ , $β$ , $ϵ$ for this analysis. Your answer should have exact values given this data set.
Estimate the regression coefficients $\hat{β}$ in R using the matrix representation. Show the code used to get the answer.
Check your results from part (b) by using the lm function to fit the model. Neatly display your results using 3 digits.

Exercise 9

Calculate $R^{2}$ for the model in the previous exercise and interpret it in the context of the data.
Calculate $R M S E$ for the model from the previous exercise and interpret it in the context of the data.
Comment on the model fit based on $R^{2}$ and $R M S E$ .

Exercise 10

a. Interpret the slope in the context of the data.

b. The average air temperature in 2019, the most recent year in the data set, was 7.925 degrees Celsius. What was the predicted ice duration for 2019? What is the residual?

Submission

Warning

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.

If you write your responses to conceptual exercises by hand, you will need to combine your written work to the completed PDF for the applied exercises before submitting on Gradescope.

Instructions to combine PDFs:

Preview (Mac): support.apple.com/guide/preview/combine-pdfs-prvw43696/mac
Adobe (Mac or PC): helpx.adobe.com/acrobat/using/merging-files-single-pdf.html
- Get free access to Adobe Acrobat as a Duke student: oit.duke.edu/help/articles/kb0030141/

To submit your assignment:

Access Gradescope through the menu on the STA 221 Canvas site.
Click on the assignment, and you’ll be prompted to submit it.
Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).
Select the first page of your .PDF submission to be associated with the “Workflow & formatting” section.

Grading (50 points)

Component	Points
Ex 1	4
Ex 2	5
Ex 3	4
Ex 4	4
Ex 5	4
Ex 6	5
Ex 7	5
Ex 8	6
Ex 9	5
Ex 10	4
Workflow & formatting	4

The “Workflow & formatting” grade is to assess the reproducible workflow and document format for the applied exercises. This includes having at least 3 informative commit messages, a neatly organized document with readable code and your name and the date updated in the YAML.

Footnotes

Exercise 4 is adapted from Montgomery, Peck, and Vining (2021) .↩︎

Introduction

Learning goals

Getting started

Packages

Conceptual exercises1

Instructions

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Applied exercises

Instructions

Data

Analysis goal

Exercise 6

Exercise 7

Exercise 8

Exercise 9

Exercise 10

Submission

Grading (50 points)

Footnotes

Conceptual exercises¹