HW 02: Multiple linear regression

Due date

This assignment is due on Thursday, February 13 at 11:59pm.

Introduction

The goal of this assignment is to use multiple linear regression to analyze the relationships between various features and the Amazon.com price for LEGO^® sets and to draw conclusions using statistical inference.

Learning goals

In this assignment, you will…

explore mathematical properties of linear regression models.
use multiple linear regression to model the relationship between three or more variables.
draw conclusions from the model using statistical inference.
fit and interpret models with interaction terms.

Getting started

Go to the sta221-sp25 organization on GitHub. Click on the repo with the prefix hw-02. It contains the starter documents you need to complete the lab.
Clone the repo and start a new project in RStudio. See the Lab 00 instructions for details on cloning a repo and starting a new project in R.

Packages

The following packages are used in this assignment:

Conceptual exercises¹

Instructions

The conceptual exercises are focused on explaining concepts and showing results mathematically. Show your work for each question.

You may write the answers and associated work for conceptual exercises by hand or type them in your Quarto document.

Exercise 1

In lecture, we defined the hat matrix $H$ as a projection matrix that projects $y$ onto $Col (X)$ and discussed the properties of a projection matrix. You have previously shown that $H$ is symmetric and $H$ is idempotent. Now we will focus on two other properties.

Show that for any vector $v$ in $Col (X)$ , $Hv = v$ .
Show that for any vector $v$ orthogonal to $Col (X)$ , $Hv = 0$ .

Exercise 2

Show that the following is true for the residuals from a linear regression model: $e = (I - H) ϵ$
Find the $E (e)$ , the expected value of the residuals.

Exercise 3

Let $\hat{β}$ be the least-squares estimator for a linear regression model. Show that $V a r (\hat{β}) = σ_{ϵ}^{2} (X^{T} X)^{- 1}$ .

Exercise 4

Suppose we fit the model $y = X_{1} β_{1} + ϵ$ when the true model is actually given by $y = X_{1} β_{1} + X_{2} β_{2} + ϵ$ . Assume $E (ϵ) = 0$ for both models.

Find $E ({\hat{β}}_{1})$ , the expected value of the least-squares estimate ${\hat{β}}_{1}$ .
Under what condition does $E ({\hat{β}}_{1}) = β_{1}$ ? What is the relationship between $X_{1}$ and $X_{2}$ under this condition?

Exercise 5

We conduct least-squares regression analysis with certain assumptions underlying the regression model. Consider the linear regression model:

$\begin{matrix} (1) & y = X β + ϵ, ϵ \sim N (0, σ_{ϵ}^{2} I) \end{matrix}$

This model relies on four assumptions:

Linearity: There is a linear relationship between the response and predictor variables.
Constant Variance: The variability about the least squares line is constant for every value of the predictor.
Normality: The distribution of the residuals is approximately normal.
Independence: The residuals are independent from one another.

For each condition, state the components of Equation 1 that are used to represent it.

Applied exercises

Instructions

The applied exercises are focused on applying the concepts to analyze data.

All work for the applied exercises must be typed in your Quarto document following a reproducible workflow.

Write all narrative using complete sentences and include informative axis labels / titles on visualizations.

Data: LEGO^® sets

The data set includes information about LEGO^® sets from themes produced January 1, 2018 and September 11, 2020. The data were originally scraped from Brickset.com, an online LEGO set guide and were obtained for this assignment from Peterson and Ziegler (2021).

You will work with data on 391 randomly selected LEGO^® sets produced during this time period. The primary variables are interest in this analysis are

Pieces: Number of pieces in the set from brickset.com.
Minifigures: Number of minifigures (LEGO^® people) in the set scraped from brickset.com.
Amazon_Price: Price of the set on Amazon.com (in U.S. dollars)
Size: General size of the interlocking bricks (Large = LEGO Duplo^® sets - which include large brick pieces safe for children ages 1 to 5, Small = LEGO^® sets which- include the traditional smaller brick pieces created for age groups 5 and - older, e.g., City, Friends)

The data are contained in lego-sample.csv.

legos <- read_csv("data/lego-sample.csv")

Analysis goal

We want to fit a multiple linear regression model to predict the price of LEGO^® sets on Amazon.com based on Pieces, Size, and Minifigures.

Exercise 6

Instead of using the number of minifigures in the model, you decide to create an indicator variable for whether or not there are any minifigures in the set.

Create an indicator variable that takes the value “No” if there are zero minifigures in the LEGO^® set, and “Yes” if there is at least one minifigure.
Fit the regression model using the number of pieces, size of the blocks, and the indicator for minifigures to predict the price on Amazon. Neatly display the results, including the 95% confidence interval for the coefficients, using three digits.

Exercise 7

We want to understand the relationship between Pieces and Amazon_Price in the model from the previous exercise.

You are convinced from the model output that there is evidence of a linear relationship between the two variables. Now you want to be more specific and test whether the slope is actually different from 0.1 ($10 increase in the price for every 100 additional pieces).

Write the null and alternative hypotheses for this test in using words and mathematical notation.
Calculate the test statistic for this test. Show the code used to calculate the test statistic; you may use any relevant output from the model in the previous exercise.
What is the distribution of the test statistic under the null hypothesis for this problem?
Calculate the p-value and state your conclusion in the context of the data using a threshold of $α = 0.05$ .

Exercise 8

Interpret the 95% confidence interval for the effect of pieces in the context of the data.
Is the confidence interval consistent with your conclusion from the hypothesis test in the previous exercise? Briefly explain.

Exercise 9

You hypothesize that the relationship between the price and number of pieces may differ based on whether or not there are minifigures in the set.

Make a plot to visualize this potential effect. Does the relationship between price and number of pieces seem to differ based on the inclusion of minifigures? Briefly explain.
Fit a model using the number of pieces, size of the blocks, and presence of minifigures to predict the price on Amazon.com. Fit the model such that the intercept has a meaningful interpretation and that the effect of pieces may differ based on the presence of minifigures.
Interpret the intercept in the context of the data.

Exercise 10

Which model is a better fit for the data - The model in Exercise 6 or the model in Exercise 9? Briefly explain your choice using $R^{2}$ and/or $A d j . R^{2}$ .

Submission

Warning

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.

If you write your responses to conceptual exercises by hand, you will need to combine your written work to the completed PDF for the applied exercises before submitting on Gradescope.

Instructions to combine PDFs:

Preview (Mac): support.apple.com/guide/preview/combine-pdfs-prvw43696/mac
Adobe (Mac or PC): helpx.adobe.com/acrobat/using/merging-files-single-pdf.html
- Get free access to Adobe Acrobat as a Duke student: oit.duke.edu/help/articles/kb0030141/

To submit your assignment:

Access Gradescope through the menu on the STA 221 Canvas site.
Click on the assignment, and you’ll be prompted to submit it.
Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).
Select the first page of your .PDF submission to be associated with the “Workflow & formatting” section.

Grading

Component	Points
Ex 1	4
Ex 2	4
Ex 3	4
Ex 4	4
Ex 5	4
Ex 6	4
Ex 7	8
Ex 8	4
Ex 9	8
Ex 10	3
Workflow & formatting	3

The “Workflow & formatting” grade is to assess the reproducible workflow and document format for the applied exercises. This includes having at least 3 informative commit messages, a neatly organized document with readable code and your name and the date updated in the YAML.

References

Montgomery, Douglas C, Elizabeth A Peck, and G Geoffrey Vining. 2021. Introduction to Linear Regression Analysis. John Wiley & Sons.

Peterson, Anna D., and Laura Ziegler. 2021. “Building a Multiple Linear Regression Model With LEGO Brick Data.” Journal of Statistics and Data Science Education 29 (3): 297–303. https://doi.org/10.1080/26939169.2021.1946450.

Footnotes

Exercise 4 is adapted from Montgomery, Peck, and Vining (2021) .↩︎

Introduction

Learning goals

Getting started

Packages

Conceptual exercises1

Instructions

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Applied exercises

Instructions

Data: LEGO® sets

Exercise 6

Exercise 7

Exercise 8

Exercise 9

Exercise 10

Submission

Grading

References

Footnotes

Conceptual exercises¹

Data: LEGO^® sets