Lab 02: Linear regression

Due date

This lab is due on Tuesday, January 28 at 11:59pm. To be considered on time, the following must be done by the due date:

Final .qmd and .pdf files pushed to your GitHub repo
Final .pdf file submitted on Gradescope

Introduction

In today’s lab you will analyze data from over 1,000 different coffees and use linear regression to explore the relationship between a coffee’s aroma, flavor, and overall quality using linear regression.

Learning goals

By the end of the lab, you will…

Continue developing a reproducible workflow using RStudio and GitHub
Produce visualizations and summary statistics to describe distributions
Fit, interpret, and evaluate linear regression models
Use the matrix representation of the linear regression model to estimate coefficients
Explore properties of the linear regression model

Getting Started

Go to the sta221-sp25 organization on GitHub. Click on the repo with the prefix lab-02. It contains the starter documents you need to complete the lab.
Clone the repo and start a new project in RStudio. See the Lab 00 instructions for details on cloning a repo and starting a new project in R.

Packages

You will need the following packages for the lab:

library(tidyverse)
library(tidymodels) # contains broom, yardstick, and other modeling packages
library(knitr)

# add other packages as needed

Data: Coffee ratings

The data set for this lab comes from the Coffee Quality Database and was obtained from the #TidyTuesday GitHub repo. It includes information about the origin, producer, measures of various characteristics, and the quality measure for over 1,000 coffees.

This lab will focus on the following variables:

total_cup_points: Total number of points, indicating overall quality, 0 (worst quality) - 100 (best quality)
aroma: Aroma grade, 0 (worst aroma) - 10 (best aroma)
flavor: Flavor grade, 0 (worst flavor) - 10 (best flavor)

Click here for the definitions of all variables in the data set. Click here for more details about how these measures are obtained.

coffee <- read_csv("data/coffee-ratings.csv")

Exercises

Goal: The goal of this analysis is to use a coffee’s aroma and flavor to understand variability in its total cup points.

Write all code and narrative in your Quarto file. Write all narrative in complete sentences. Throughout the assignment, you should periodically render your Quarto document to produce the updated PDF, commit the changes in the Git pane, and push the updated files to GitHub.

Important

Make sure we can read all of your code in your PDF document. This means you will need to break up long lines of code. One way to help avoid long lines of code is is start a new line after every pipe (|>) and plus sign (+).

Exercise 1

We begin with univariate exploratory data analysis.

Visualize the distribution of the response variable total_cup_points and calculate summary statistics.
Comment on the features of the distribution of this variable by describing the shape, center, spread, and presence of potential outliers.
Based on this distribution, do you think the data set is representative of all coffee available to consumers? Briefly explain.

Tip

Make sure your data visualizations have clear and informative titles and axis labels.

Exercise 2

Now let’s consider the relationship between how good a coffee smells and its overall quality.

Visualize the relationship between aroma and total_cup_points.
Does there appear to be a relationship between a coffee’s aroma and its overall quality? If so, what is the shape and direction of the relationship?

Exercise 3

We have seen the mathematical formulation for simple linear regression in class. In particular, given a response variable $Y$ and predictor variable $X$ , the simple linear regression model is $Y = β_{0} + β_{1} X + ϵ$

for some unknown regression coefficients for intercept and slope $(β_{0}, β_{1})$ and error terms $ϵ$ that are centered at 0 and have variance $σ_{ϵ}^{2}$ . This means that the expected value of each observation lies on the regression line
$E (Y | X) = β_{0} + β_{1} X$

Answer the following questions about simple linear regression. Your response should be in general terms about regression, not be specific to the coffee data.

What does $E (Y | X) = β_{0} + β_{1} X$ mean in terms of a given value of $X$ ?
What are the interpretations of the coefficients $β_{0}$ and $β_{1}$ in terms of the expected value of $Y$ ?

This is a good place to render, commit, and push changes to your lab-01 repo on GitHub. Write an informative commit message (e.g., “Completed exercises 1 - 3”), and push every file to GitHub by clicking the check box next to each file in the Git pane. After you push the changes, the Git pane in RStudio should be empty.

Exercise 4

Fit the model of the relationship between aroma and total_cup_points. Neatly display the output using 3 digits.
Interpret the slope in the context of the data.
What is the expected total_cup_points for coffees that receive the worst aroma score of 0? Is this a reliable estimate of the total_cup_points for these coffees? Briefly explain why or why not.

Exercise 5

Now let’s add flavor to the model, so we will use both flavor and aroma to understand variability in the overall quality of coffees. Use this model for the remainder of the lab.

In class we have seen how vectors and matrices can be used to represent the linear regression model:

$y = X β + ϵ$

State the dimensions of $y, X, β, ϵ$ for this model. Your answer should have exact values given the coffee data set.
Compute the estimated regression coefficients using the matrix form of the model. Show the code used to get the answer.

Tip

You can use the model.matrix() function to get the design matrix. The code takes the general form:

model.matrix(y ~ x, data = my_data)

See Lab 01 for other matrix operations in R.

Check your results from part (b) by using the lm function to fit the model. Neatly display your results using 3 digits.
Write the estimated regression equation.

This is a good place to render, commit, and push changes to your lab-01 repo on GitHub. Write an informative commit message (e.g., “Completed exercises 4 - 5”), and push every file to GitHub by clicking the check box next to each file in the Git pane. After you push the changes, the Git pane in RStudio should be empty.

Exercise 6

The coefficient for aroma for the model fit in Exercise 5 is different than the coefficient from the model fit in Exercise 4. Briefly explain why these coefficients are different.
Would you willingly drink a coffee represented by the intercept of the model in Exercise 5? Briefly explain why or why not.

Exercise 7

Compute $H$ , the hat matrix corresponding to the model from Exercise 5. Then use $H$ to compute the residuals for this model. Do not print out $H$ or the residuals.
Compute the mean and standard deviation of the residuals.
Recall root mean square error RMSE

$R M S E = \sqrt{\frac{\sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2}}{n}}$

Similar to other statistics we’ve seen thus far, we can write the RMSE in matrix form. Compute RMSE for the model from Exercise 5 using matrix form. Show the code used to get the answer.
How do the standard deviation of the residuals and RMSE compare?

You’re done and ready to submit your work! render, commit, and push all remaining changes. You can use the commit message “Done with Lab 02!”, and make sure you have pushed all the files to GitHub (your Git pane in RStudio should be empty) and that all documents are updated in your repo on GitHub. The PDF document you submit to Gradescope should be identical to the one in your GitHub repo.

Submission

You will submit the PDF documents for labs, homework, and exams in to Gradescope as part of your final submission.

Warning

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.

To submit your assignment:

Access Gradescope through the menu on the STA 221 Canvas site.
Click on the assignment, and you’ll be prompted to submit it.
Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).
Select the first page of your .PDF submission to be associated with the “Workflow & formatting” section.

Grading

Component	Points
Ex 1	6
Ex 2	5
Ex 3	7
Ex 4	7
Ex 5	8
Ex 6	5
Ex 7	8
Workflow & formatting	4

The “Workflow & formatting” grade is to assess the reproducible workflow and document format. This includes having at least 3 informative commit messages, a neatly organized document with readable code and your name and the date updated in the YAML.