Jan 16, 2025
Data wrangling with dplyr - Thu, Jan 16 at 12pm
Data visualization with ggplot2 - Thu, Jan 23 at 12pm
Suppose that a movie has a critics score of 70. According to this model, what is the movie’s predicted audience score?
Caution
Using the model to predict for values outside the range of the original data is extrapolation. Why do we want to avoid extrapolation?
Use the lm()
function to fit a linear regression model
Use the tidy()
function from the broom R package to “tidy” the model output
movie_fit <- lm(audience ~ critics, data = movie_scores) tidy(movie_fit)
movie_fit <- lm(audience ~ critics, data = movie_scores) tidy(movie_fit)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 32.3 2.34 13.8 4.03e-28
2 critics 0.519 0.0345 15.0 2.70e-31
Use the kable()
function from the knitr package to neatly format the results
Use the predict()
function to calculate predictions for new observations
Single observation
Use the predict()
function to calculate predictions for new observations
Multiple observations
The data set comes from Zarulli et al. (2021) who analyze the effects of a country’s healthcare expenditures and other factors on the country’s life expectancy. The data are originally from the Human Development Database and World Health Organization.
There are 140 countries (observations) in the data set.
Goal: Use the income inequality in a country to understand variability in the life expectancy.
life_exp
: The average number of years that a newborn could expect to live, if he or she were to pass through life exposed to the sex- and age-specific death rates prevailing at the time of his or her birth, for a specific year, in a given country, territory, or geographic income_inequality. ( from the World Health Organization)
income_inequality
: Measure of the deviation of the distribution of income among individuals or households within a country from a perfectly equal distribution. A value of 0 represents absolute equality, a value of 100 absolute inequality (based on Gini coefficient). (from Zarulli et al. (2021))
📋 sta221-sp25.netlify.app/ae/ae-01-model-assessment.html
Complete Part 1.
Go to the course organization. Click on the repo with the prefix ae-01
. It contains the starter documents you need to complete the AE.
ae-01
repo, use this link to create one: https://classroom.github.com/a/6jpkfA8nClick on the green CODE button, select Use SSH (this might already be selected by default, and if it is, you’ll see the text Clone with SSH). Click on the clipboard icon to copy the repo URL.
In RStudio, go to File → New Project → Version Control → Git.
Copy and paste the URL of your assignment repo into the dialog box Repository URL.
Click Create Project, and the files from your GitHub repo will be displayed in the Files pane in RStudio.
Click ae-01.qmd
to open the template Quarto file. This is where you will write up your code and narrative for the AE.
We fit a model but is it any good?
Root mean square error, RMSE: A measure of the average error (average difference between observed and predicted values of the outcome)
R-squared,
What indicates a good model fit? Higher or lower RMSE? Higher or lower
Ranges between 0 (perfect predictor) and infinity (terrible predictor)
Same units as the response variable
The value of RMSE is more useful for comparing across models than evaluating a single model
Analysis of Variance (ANOVA): Technique to partition variability in
Min | Median | Max | Mean | Std.Dev |
---|---|---|---|---|
51.6 | 72.85 | 84.1 | 71.614 | 8.075 |
life_exp
The coefficient of determination
What is the range of
Submit your response to the following question on Ed Discussion.
The
Use the augment()
function from the broom package to add columns for predicted values, residuals, and other observation-level model statistics
# A tibble: 140 × 8
life_exp income_inequality .fitted .resid .hat .sigma .cooksd .std.resid
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 63.8 28.2 65.8 -1.96 0.0125 4.45 0.00125 -0.444
2 78.2 12.2 76.9 1.29 0.0116 4.45 0.000498 0.292
3 59.9 32.4 62.8 -2.93 0.0193 4.44 0.00437 -0.667
4 76.2 14 75.7 0.542 0.00972 4.45 0.0000739 0.123
5 74.6 8.6 79.4 -4.82 0.0168 4.43 0.0102 -1.10
6 83 8.3 79.6 3.37 0.0173 4.44 0.00515 0.766
7 81.3 7.4 80.3 1.04 0.0189 4.45 0.000540 0.237
8 72.5 10.1 78.4 -5.88 0.0144 4.42 0.0130 -1.33
9 71.8 27.6 66.2 5.62 0.0118 4.43 0.00972 1.28
10 74 6.5 80.9 -6.89 0.0207 4.41 0.0260 -1.57
# ℹ 130 more rows
Use the rmse()
function from the yardstick package (part of tidymodels)
Use the rsq()
function from the yardstick package (part of tidymodels)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rsq standard 0.700
📋 sta221-sp25.netlify.app/ae/ae-01-model-assessment.html
Complete Part 2.
Used R to conduct exploratory data analysis and fit a model
Evaluated models using RMSE and
Used analysis of variance to partition variability in the response variable