library(tidyverse)
library(tidymodels) # contains broom, yardstick, and other modeling packages
library(knitr)
# add other packages as needed
Lab 02: Linear regression
This lab is due on Tuesday, January 28 at 11:59pm. To be considered on time, the following must be done by the due date:
- Final
.qmd
and.pdf
files pushed to your GitHub repo - Final
.pdf
file submitted on Gradescope
Introduction
In today’s lab you will analyze data from over 1,000 different coffees and use linear regression to explore the relationship between a coffee’s aroma, flavor, and overall quality using linear regression.
Learning goals
By the end of the lab, you will…
- Continue developing a reproducible workflow using RStudio and GitHub
- Produce visualizations and summary statistics to describe distributions
- Fit, interpret, and evaluate linear regression models
- Use the matrix representation of the linear regression model to estimate coefficients
- Explore properties of the linear regression model
Getting Started
Go to the sta221-sp25 organization on GitHub. Click on the repo with the prefix lab-02. It contains the starter documents you need to complete the lab.
Clone the repo and start a new project in RStudio. See the Lab 00 instructions for details on cloning a repo and starting a new project in R.
Packages
You will need the following packages for the lab:
Data: Coffee ratings
The data set for this lab comes from the Coffee Quality Database and was obtained from the #TidyTuesday GitHub repo. It includes information about the origin, producer, measures of various characteristics, and the quality measure for over 1,000 coffees.
This lab will focus on the following variables:
total_cup_points
: Total number of points, indicating overall quality, 0 (worst quality) - 100 (best quality)aroma
: Aroma grade, 0 (worst aroma) - 10 (best aroma)flavor
: Flavor grade, 0 (worst flavor) - 10 (best flavor)
Click here for the definitions of all variables in the data set. Click here for more details about how these measures are obtained.
<- read_csv("data/coffee-ratings.csv") coffee
Exercises
Goal: The goal of this analysis is to use a coffee’s aroma and flavor to understand variability in its total cup points.
Write all code and narrative in your Quarto file. Write all narrative in complete sentences. Throughout the assignment, you should periodically render your Quarto document to produce the updated PDF, commit the changes in the Git pane, and push the updated files to GitHub.
Make sure we can read all of your code in your PDF document. This means you will need to break up long lines of code. One way to help avoid long lines of code is is start a new line after every pipe (|>
) and plus sign (+
).
Exercise 1
We begin with univariate exploratory data analysis.
- Visualize the distribution of the response variable
total_cup_points
and calculate summary statistics. - Comment on the features of the distribution of this variable by describing the shape, center, spread, and presence of potential outliers.
- Based on this distribution, do you think the data set is representative of all coffee available to consumers? Briefly explain.
Make sure your data visualizations have clear and informative titles and axis labels.
Exercise 2
Now let’s consider the relationship between how good a coffee smells and its overall quality.
- Visualize the relationship between
aroma
andtotal_cup_points
. - Does there appear to be a relationship between a coffee’s aroma and its overall quality? If so, what is the shape and direction of the relationship?
Exercise 3
We have seen the mathematical formulation for simple linear regression in class. In particular, given a response variable
for some unknown regression coefficients for intercept and slope
Answer the following questions about simple linear regression. Your response should be in general terms about regression, not be specific to the coffee
data.
- What does
mean in terms of a given value of ? - What are the interpretations of the coefficients
and in terms of the expected value of ?
This is a good place to render, commit, and push changes to your lab-01 repo on GitHub. Write an informative commit message (e.g., “Completed exercises 1 - 3”), and push every file to GitHub by clicking the check box next to each file in the Git pane. After you push the changes, the Git pane in RStudio should be empty.
Exercise 4
- Fit the model of the relationship between
aroma
andtotal_cup_points
. Neatly display the output using 3 digits. - Interpret the slope in the context of the data.
- What is the expected
total_cup_points
for coffees that receive the worst aroma score of0
? Is this a reliable estimate of thetotal_cup_points
for these coffees? Briefly explain why or why not.
Exercise 5
Now let’s add flavor to the model, so we will use both flavor and aroma to understand variability in the overall quality of coffees. Use this model for the remainder of the lab.
In class we have seen how vectors and matrices can be used to represent the linear regression model:
- State the dimensions of
for this model. Your answer should have exact values given thecoffee
data set. - Compute the estimated regression coefficients using the matrix form of the model. Show the code used to get the answer.
You can use the model.matrix()
function to get the design matrix. The code takes the general form:
model.matrix(y ~ x, data = my_data)
See Lab 01 for other matrix operations in R.
- Check your results from part (b) by using the
lm
function to fit the model. Neatly display your results using 3 digits. - Write the estimated regression equation.
This is a good place to render, commit, and push changes to your lab-01 repo on GitHub. Write an informative commit message (e.g., “Completed exercises 4 - 5”), and push every file to GitHub by clicking the check box next to each file in the Git pane. After you push the changes, the Git pane in RStudio should be empty.
Exercise 6
- The coefficient for
aroma
for the model fit in Exercise 5 is different than the coefficient from the model fit in Exercise 4. Briefly explain why these coefficients are different. - Would you willingly drink a coffee represented by the intercept of the model in Exercise 5? Briefly explain why or why not.
Exercise 7
Compute
, the hat matrix corresponding to the model from Exercise 5. Then use to compute the residuals for this model. Do not print out or the residuals.Compute the mean and standard deviation of the residuals.
Recall root mean square error RMSE
Similar to other statistics we’ve seen thus far, we can write the RMSE in matrix form. Compute RMSE for the model from Exercise 5 using matrix form. Show the code used to get the answer.
How do the standard deviation of the residuals and RMSE compare?
You’re done and ready to submit your work! render, commit, and push all remaining changes. You can use the commit message “Done with Lab 02!”, and make sure you have pushed all the files to GitHub (your Git pane in RStudio should be empty) and that all documents are updated in your repo on GitHub. The PDF document you submit to Gradescope should be identical to the one in your GitHub repo.
Submission
You will submit the PDF documents for labs, homework, and exams in to Gradescope as part of your final submission.
Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.
Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.
To submit your assignment:
Access Gradescope through the menu on the STA 221 Canvas site.
Click on the assignment, and you’ll be prompted to submit it.
Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).
Select the first page of your .PDF submission to be associated with the “Workflow & formatting” section.
Grading
Component | Points |
---|---|
Ex 1 | 6 |
Ex 2 | 5 |
Ex 3 | 7 |
Ex 4 | 7 |
Ex 5 | 8 |
Ex 6 | 5 |
Ex 7 | 8 |
Workflow & formatting | 4 |
The “Workflow & formatting” grade is to assess the reproducible workflow and document format. This includes having at least 3 informative commit messages, a neatly organized document with readable code and your name and the date updated in the YAML.