Simple linear regression

Prof. Maria Tackett

Jan 14, 2025

Announcements

  • Complete Lab 00

  • Office hours start this week

    • Alan’s office hours start January 27
  • Introduction to R workshops at Duke library

    • Data wrangling with dplyr - Thu, Jan 16 at 12pm

    • Data visualization with ggplot2 - Thu, Jan 23 at 12pm

Questions from last class?

Topics

  • How regression is used to understand the relationship between multiple variables
  • Least squares estimation for the slope and intercept
  • Interpret the slope and intercept
  • Predict the response given a value of the predictor

Computing set up

# load packages
library(tidyverse)        # for data wrangling
library(broom)            # for formatting regression output
library(fivethirtyeight)  # for the fandango dataset
library(knitr)            # for formatting tables
library(patchwork)        # for arranging graphs

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 16))

# set default figure parameters for knitr
knitr::opts_chunk$set(
  fig.width = 8,
  fig.asp = 0.618,
  fig.retina = 3,
  dpi = 300,
  out.width = "80%"
)

Source: R for Data Science with additions from The Art of Statistics: How to Learn from Data.

Source:R for Data Science

Data

Movie scores

  • Data behind the FiveThirtyEight story Be Suspicious Of Online Movie Ratings, Especially Fandango’s
  • In the fivethirtyeight package: fandango
  • Contains every film released in 2014 and 2015 that has at least 30 fan reviews on Fandango, an IMDb score, Rotten Tomatoes critic and user ratings, and Metacritic critic and user scores

Fandango logo

IMDB logo

Rotten Tomatoes logo

Metacritic logo

Data prep

  • Rename Rotten Tomatoes columns as critics and audience
  • Rename the data set as movie_scores
movie_scores <- fandango |>
  rename(critics = rottentomatoes, 
         audience = rottentomatoes_user)

Data overview

glimpse(movie_scores)
Rows: 146
Columns: 23
$ film                       <chr> "Avengers: Age of Ultron", "Cinderella", "A…
$ year                       <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2…
$ critics                    <int> 74, 85, 80, 18, 14, 63, 42, 86, 99, 89, 84,…
$ audience                   <int> 86, 80, 90, 84, 28, 62, 53, 64, 82, 87, 77,…
$ metacritic                 <int> 66, 67, 64, 22, 29, 50, 53, 81, 81, 80, 71,…
$ metacritic_user            <dbl> 7.1, 7.5, 8.1, 4.7, 3.4, 6.8, 7.6, 6.8, 8.8…
$ imdb                       <dbl> 7.8, 7.1, 7.8, 5.4, 5.1, 7.2, 6.9, 6.5, 7.4…
$ fandango_stars             <dbl> 5.0, 5.0, 5.0, 5.0, 3.5, 4.5, 4.0, 4.0, 4.5…
$ fandango_ratingvalue       <dbl> 4.5, 4.5, 4.5, 4.5, 3.0, 4.0, 3.5, 3.5, 4.0…
$ rt_norm                    <dbl> 3.70, 4.25, 4.00, 0.90, 0.70, 3.15, 2.10, 4…
$ rt_user_norm               <dbl> 4.30, 4.00, 4.50, 4.20, 1.40, 3.10, 2.65, 3…
$ metacritic_norm            <dbl> 3.30, 3.35, 3.20, 1.10, 1.45, 2.50, 2.65, 4…
$ metacritic_user_nom        <dbl> 3.55, 3.75, 4.05, 2.35, 1.70, 3.40, 3.80, 3…
$ imdb_norm                  <dbl> 3.90, 3.55, 3.90, 2.70, 2.55, 3.60, 3.45, 3…
$ rt_norm_round              <dbl> 3.5, 4.5, 4.0, 1.0, 0.5, 3.0, 2.0, 4.5, 5.0…
$ rt_user_norm_round         <dbl> 4.5, 4.0, 4.5, 4.0, 1.5, 3.0, 2.5, 3.0, 4.0…
$ metacritic_norm_round      <dbl> 3.5, 3.5, 3.0, 1.0, 1.5, 2.5, 2.5, 4.0, 4.0…
$ metacritic_user_norm_round <dbl> 3.5, 4.0, 4.0, 2.5, 1.5, 3.5, 4.0, 3.5, 4.5…
$ imdb_norm_round            <dbl> 4.0, 3.5, 4.0, 2.5, 2.5, 3.5, 3.5, 3.5, 3.5…
$ metacritic_user_vote_count <int> 1330, 249, 627, 31, 88, 34, 17, 124, 62, 54…
$ imdb_user_vote_count       <int> 271107, 65709, 103660, 3136, 19560, 39373, …
$ fandango_votes             <int> 14846, 12640, 12055, 1793, 1021, 397, 252, …
$ fandango_difference        <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5…

Univariate exploratory data analysis (EDA)

The data set contains the “Tomatometer” score (critics) and audience score (audience) for 146 movies rated on rottentomatoes.com.

Bivariate EDA

Bivariate EDA

Goal: Fit a line to describe the relationship between the critics score and audience score.

Why fit a line?

We fit a line to accomplish one or both of the following:


Prediction

What is an example of a prediction question for this data set?


Inference

What is an example of an inference question for this data set?

Terminology

  • Response, \(Y\): variable describing the outcome of interest

  • Predictor, \(X\): variable we use to help understand the variability in the response

Regression model

A regression model is a function that describes the relationship between the response, \(Y\), and the predictor, \(X\).

\[\begin{aligned} Y &= \color{black}{\textbf{Model}} + \text{Error} \\[8pt] &= \color{black}{f(X)} + \epsilon \\[8pt] & = \color{black}{E(Y|X)} + \epsilon \\[8pt] &= \color{black}{\mu_{Y|X}} + \epsilon \end{aligned}\]

Regression model

\[\begin{aligned} Y &= \color{purple}{\textbf{Model}} + \color{black}\text{Error} \\[8pt] &= \color{purple}{f(X)} + \color{black}\epsilon \\[8pt] &= \color{purple}{E(Y|X)} + \color{black}\epsilon \\[8pt] &= \color{purple}{\mu_{Y|X}} + \color{black}\epsilon \end{aligned}\]

\(E(Y|X) = \mu_{Y|X}\), the mean value of \(Y\) given a particular value of \(X\).

Regression model

\[ \begin{aligned} Y &= \color{purple}{\textbf{Model}} + \color{blue}{\textbf{Error}} \\[8pt] &= \color{purple}{f(X)} + \color{blue}{\epsilon}\\[8pt] &= \color{purple}{E(Y|X)} + \color{blue}{\epsilon}\\[8pt] &= \color{purple}{\mu_{Y|X}} + \color{blue}{\epsilon} \\ \end{aligned} \]

Determine \(f(X)\)

  • Goal: Determine \(f(X)\)

  • How do we determine \(f(X)\)

    • Make an assumption about the functional form \(f(X)\) (parametric model)

    • Use the data to fit a model based on that form

Simple linear regression (SLR)

SLR: Statistical model (population)

When we have a quantitative response, \(Y\), and a single quantitative predictor, \(X\), we can use a (simple) linear regression model to describe the relationship between \(Y\) and \(X\). \[Y = \beta_0 + \beta_1X + \epsilon\]

  • \(\beta_1\): Population (true) slope of the relationship between \(X\) and \(Y\)
  • \(\beta_0\): Population (true) intercept of the relationship between \(X\) and \(Y\)
  • \(\epsilon\): Error terms with mean 0 and variance \(\sigma^2_{\epsilon}\)

SLR: Regression equation (sample)

\[\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X\]

  • \(\hat{\beta}_1\): Estimated (sample) slope of the relationship between \(X\) and \(Y\)
  • \(\hat{\beta}_0\): Estimated (sample) intercept of the relationship between \(X\) and \(Y\)
  • No error term!

Why is there no error term in the estimated regression equation?

Estimating \(\hat{\beta}_1\) and \(\hat{\beta}_0\)

Residuals

\[\text{residual} = \text{observed} - \text{predicted} = y_i - \hat{y}_i\]

Least squares line

  • The residual for the \(i^{th}\) observation is

\[e_i = \text{observed} - \text{predicted} = y_i - \hat{y}_i\]

  • The sum of squared residuals is

\[e^2_1 + e^2_2 + \dots + e^2_n\]

  • The Ordinary Least Squares (OLS) line is the one that minimizes the sum of squared residuals

Least-squares estimate of \(\hat{\beta}_0\)

Click here for full details on estimating \(\hat{\beta}_0\) and \(\hat{\beta}_1\) .

Slope and intercept

Properties of least squares regression

  • The regression line goes through the center of mass point, the coordinates corresponding to mean \(X\) and mean \(Y\): \(\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1\bar{X}\)

  • The slope has the same sign as the correlation coefficient: \(\hat{\beta}_1 = r \frac{s_Y}{s_X}\)

  • The sum of the residuals is approximately zero: \(\sum_{i = 1}^n e_i \approx 0\)

  • The residuals and \(X\) values are uncorrelated

Estimating the slope

\[\large{\hat{\beta}_1 = r \frac{s_Y}{s_X}}\]


\[ \begin{aligned} s_X = 30.1688 \hspace{15mm} &s_Y = 20.0244 \hspace{15mm} r = 0.7814 \\[10pt]\hat{\beta}_1 &= 0.7814 \times \frac{20.0244}{30.1688} \\&= \mathbf{0.5187}\end{aligned} \]

Estimating the intercept

\[\large{\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1\bar{X}}\]


\[ \begin{aligned}\bar{x} = 60.8493 & \hspace{15mm} \bar{y} = 63.8767 \hspace{15mm} \hat{\beta}_1 = 0.5187 \\[10pt] \hat{\beta}_0 &= 63.8767 - 0.5187 \times 60.8493 \\ &= \mathbf{32.3142}\end{aligned} \]

Interpretation

  • Question
  • Submit

Submit your answers to the following questions on Ed Discussion:

  • The slope of the model for predicting audience score from critics score is 0.5187 . Which of the following is the best interpretation of this value?

  • 32.3142 is the predicted mean audience score for what type of movies?

🔗 https://edstem.org/us/courses/70811/discussion/5967675

Does it make sense to interpret the intercept?

✅ The intercept is meaningful in the context of the data if

  • the predictor can feasibly take values equal to or near zero, or

  • there are values near zero in the observed data.

🛑 Otherwise, the intercept may not be meaningful!

Prediction

Making a prediction

Suppose that a movie has a critics score of 70. According to this model, what is the movie’s predicted audience score?

\[\begin{aligned} \widehat{\text{audience}} &= 32.3142 + 0.5187 \times \text{critics} \\ &= 32.3142 + 0.5187 \times 70 \\ &= \mathbf{68.6232} \end{aligned}\]

Linear regression in R

Fit the model

Use the lm() function to fit a linear regression model


movie_fit <- lm(audience ~ critics, data = movie_scores)
movie_fit

Call:
lm(formula = audience ~ critics, data = movie_scores)

Coefficients:
(Intercept)      critics  
    32.3155       0.5187  

Tidy results

Use the tidy() function from the broom R package to “tidy” the model output


movie_fit <- lm(audience ~ critics, data = movie_scores)
tidy(movie_fit)movie_fit <- lm(audience ~ critics, data = movie_scores)
tidy(movie_fit)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   32.3      2.34        13.8 4.03e-28
2 critics        0.519    0.0345      15.0 2.70e-31

Format results

Use the kable() function from the knitr package to neatly format the results


movie_fit <- lm(audience ~ critics, data = movie_scores)
tidy(movie_fit) |>
  kable(digits = 3)movie_fit <- lm(audience ~ critics, data = movie_scores)
tidy(movie_fit) |>
  kable(digits = 3)
term estimate std.error statistic p.value
(Intercept) 32.316 2.343 13.795 0
critics 0.519 0.035 15.028 0

Prediction

Use the predict() function to calculate predictions for new observations


Single observation

new_movie <- tibble(critics = 70)
predict(movie_fit, new_movie)
       1 
68.62297 

Prediction

Use the predict() function to calculate predictions for new observations


Multiple observations

more_new_movies <- tibble(critics = c(24,70, 85))
predict(movie_fit, more_new_movies)
       1        2        3 
44.76379 68.62297 76.40313 

Recap

  • Described how regression is used to understand the relationship between multiple variables
  • Used least squares to estimate the slope and intercept
  • Interpreted the slope and intercept for simple linear regression
  • Predicted the response given a value of the predictor

Next time

  • Model assessment for simple linear regression

    • Complete Lec 03: Model Assessment prepare
  • Bring fully-charged laptop or device with keyboard for in-class application exercise (AE)

🔗 STA 221 - Spring 2025

1 / 43
Simple linear regression Prof. Maria Tackett Jan 14, 2025

  1. Slides

  2. Tools

  3. Close
  • Simple linear regression
  • Announcements
  • Questions from last class?
  • Topics
  • Computing set up
  • Source: R for Data...
  • Source:R for Data...
  • Data
  • Movie scores
  • Data prep
  • Data overview
  • Univariate exploratory data analysis (EDA)
  • Bivariate EDA
  • Bivariate EDA
  • Why fit a line?
  • Terminology
  • Regression model
  • Regression model
  • Regression model
  • Determine \(f(X)\)
  • Simple linear regression (SLR)
  • SLR: Statistical model (population)
  • SLR: Regression equation (sample)
  • Estimating \(\hat{\beta}_1\) and \(\hat{\beta}_0\)
  • Residuals
  • Least squares line
  • Least-squares estimate of \(\hat{\beta}_0\)
  • Slope and intercept
  • Properties of least squares regression
  • Estimating the slope
  • Estimating the intercept
  • Interpretation
  • Does it make sense to interpret the intercept?
  • Prediction
  • Making a prediction
  • Linear regression in R
  • Fit the model
  • Tidy results
  • Format results
  • Prediction
  • Prediction
  • Recap
  • Next time
  • f Fullscreen
  • s Speaker View
  • o Slide Overview
  • e PDF Export Mode
  • r Scroll View Mode
  • b Toggle Chalkboard
  • c Toggle Notes Canvas
  • d Download Drawings
  • ? Keyboard Help