library(tidyverse)
library(tidymodels)
library(knitr)
Lab 06: Maximum likelihood estimation
This lab is due on Tuesday, March 25 at 11:59pm. To be considered on time, the following must be done by the due date:
- Final
.qmd
and.pdf
files pushed to your team’s GitHub repo - Final
.pdf
file submitted on Gradescope
Introduction
In this lab you will compute maximum likelihood estimates for regression models looking at the relationship between features of penguins living in Palmer Archipelago in Antarctica. You will also explore properties of maximum likelihood estimators and how they are related to least-squares estimators.
Learning goals
By the end of the lab you will be able to…
- compute estimates for
and using maximum likelihood estimation. - understand how assumptions of linear regression connect to maximum likelihood estimation.
- describe the similarities and differences between maximum likelihood estimators and least-squares estimators.
- evaluate which estimation procedure may be preferable in a given analysis scenario.
Getting started
A repository has already been created for you and your teammates. Everyone in your team has access to the same repo.
Go to the sta221-sp25 organization on GitHub. Click on the repo with the prefix lab-06. It contains the starter documents you need to complete the lab.
Each person on the team should clone the repository and open a new project in RStudio. Throughout the lab, each person should get a chance to make commits and push to the repo.
Workflow: Using Git and GitHub as a team
There are no Team Member markers in this lab; however, you should use a similar workflow as in Lab 03. Only one person should type in the group’s .qmd file at a time to avoid merge conflicts. Once that person has finished typing the group’s responses, they should render, commit, and push the changes to GitHub. All other teammates can pull to see the updates in RStudio.
Every teammate must have at least one commit in the lab. Everyone is expected to contribute to discussion even when they are not typing.
Packages
You will use the following packages in today’s lab. Add other packages as needed.
Data
Today’s dataset include information about characteristics of three species of penguins living in Palmer Archipelago in Antarctica. The data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network (Gorman, Williams, and Fraser 2014).
The data are in the file palmer-penguins.csv
in the data
folder. This dataset is originally from the penguins
data frame in the palmerpenguins R package with observations that have missing values removed. This analysis will focus on the following variables:
bill_depth_mm
: bill depth in millimetersbill_length_mm
: bill length in millimetersspecies
: penguin species (Adélie, Chinstrap and Gentoo)
Click here to see the full data dictionary.
Exercises
Goal: The goal of this analysis is to use bill length to understand variability in bill depth, after accounting for species.
Exercise 1
We’ll start with exploratory data analysis focused on the relationship between the response and predictor variables.
Visualize the relationship between the response variable
bill_depth_mm
and predictorbill_length_mm
.Now, visualize the relationship between
bill_depth_mm
andbill_length_mm
byspecies
. Usegeom_smooth(method = "lm", se = FALSE)
to add lines and more clearly visualize the relationship for each species.Based on these visualizations, why is it important to include
species
when in the model of the relationship between bill depth and length? Briefly explain.Based on these visualizations, would you include an interaction term between the two predictors? Briefly explain?
Exercise 2
We will fit the main effects model using bill length and species to understand variability in the bill depth.
Write the form of the statistical (population-level) model in matrix form.
Write the dimensions for
specific for this problem.
Exercise 3
Consider the regression model described in Exercise 2.
Write the likelihood function
in matrix form.Describe how each of the four model assumptions is necessary for the form of the likelihood function.
Exercise 4
Briefly explain how the process of finding the maximum likelihood estimators for the likelihood function in Exercise 3 is related to the process of finding the least-squares estimators for the model in Exercise 2.
Exercise 5
For the next few exercises, we will compare the results of the maximum likelihood and least-squares procedures.
Fit the least-squares regression model described in Exercise 2. Neatly display the results using three digits.
Describe the estimated effect of bill length on bill depth in the context of the data.
Describe the estimated effect of species on bill depth in the context of the data. Include discussion about whether there is statistical evidence of a difference between species.
Exercise 6
Use matrix/vector operations to compute the maximum likelihood estimators
for the model in Exercise 2.How do these estimators compare to the least-squares estimators in the previous exercise?
Exercise 7
The maximum likelihood estimation procedure also produces an estimator for the variance about the regression line,
We know that the maximum likelihood estimator and least-squares estimator for
Using the data and regression estimates for this analysis, compute both the maximum likelihood and least-squares estimators for
Exercise 8
Now we will look into the last property of the maximum likelihood estimator for
In words, this property says that, when the number of samples speciesChinstrap
.
- Use
, the maximum likelihood estimator, to compute the approximate confidence interval for . The approximate 95% confidence interval may be computed as
- Then interpret this interval in the context of the data.
Exercise 9
Compute the exact (based on the
-distribution) confidence interval for , the coefficient ofspeciesChinstrap
.Compare the center and width of the this exact interval with the one you computed in Exercise 8. Do they differ? By how much? Which one is wider, indicating more uncertainty?
Exercise 10
To wrap up, we have seen that both the OLS and the maximum likelihood procedures for linear regression produce the same coefficient estimates, but lead to different estimators for the variance
Based on the work in this lab, do you think performing inference based on either method would have changed your conclusion about the the relationship between bill depth and Chinstrap species?
Submission
You will submit the PDF documents for labs, homework, and exams in to Gradescope as part of your final submission.
Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.
Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.
To submit your assignment:
Access Gradescope through the menu on the STA 221 Canvas site.
Click on the assignment, and you’ll be prompted to submit it.
Select all team members’ names, so they receive credit on the assignment. Click here for video on adding team members to assignment on Gradescope.
Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).
Select the first page of your .PDF submission to be associated with the “Workflow & formatting” section.
Grading
Component | Points |
---|---|
Ex 1 | 8 |
Ex 2 | 4 |
Ex 3 | 6 |
Ex 4 | 3 |
Ex 5 | 5 |
Ex 6 | 5 |
Ex 7 | 4 |
Ex 8 | 5 |
Ex 9 | 4 |
Ex 10 | 2 |
Workflow & formatting | 4 |
The “Workflow & formatting” grade is to assess the reproducible workflow and collaboration. This includes having at least one meaningful commit from each team member, a neatly organized document with readable code, and updating the team name and date in the YAML.