Inference for regression

Prof. Maria Tackett

Feb 04, 2025

Announcements

  • Lab 03 due TODAY at 11:59pm

  • Click here to learn more about the Academic Resource Center

  • Statistics experience due Tuesday, April 22

Poll: Office hours availability

🔗 https://forms.office.com/r/DL8rBQ988y

Topics

  • Understand statistical inference in the context of regression

  • Describe the assumptions for regression

  • Understand connection between distribution of residuals and inferential procedures

  • Conduct inference on a single coefficient

Computing setup

# load packages
library(tidyverse)  
library(tidymodels)  
library(knitr)       
library(kableExtra)  
library(patchwork)   

# set default theme in ggplot2
ggplot2::theme_set(ggplot2::theme_bw())

Data: NCAA Football expenditures

Today’s data come from Equity in Athletics Data Analysis and includes information about sports expenditures and revenues for colleges and universities in the United States. This data set was featured in a March 2022 Tidy Tuesday.

We will focus on the 2019 - 2020 season expenditures on football for institutions in the NCAA - Division 1 FBS. The variables are :

  • total_exp_m: Total expenditures on football in the 2019 - 2020 academic year (in millions USD)

  • enrollment_th: Total student enrollment in the 2019 - 2020 academic year (in thousands)

  • type: institution type (Public or Private)

football <- read_csv("data/ncaa-football-exp.csv")

Univariate EDA

Bivariate EDA

Regression model

exp_fit <- lm(total_exp_m ~ enrollment_th + type, data = football)
tidy(exp_fit) |>
  kable(digits = 3)
term estimate std.error statistic p.value
(Intercept) 19.332 2.984 6.478 0
enrollment_th 0.780 0.110 7.074 0
typePublic -13.226 3.153 -4.195 0


For every additional 1,000 students, we expect an institution’s total expenditures on football to increase by $780,000, on average, holding institution type constant.

From sample to population

For every additional 1,000 students, we expect an institution’s total expenditures on football to increase by $780,000, on average, holding institution type constant.

  • This estimate is valid for the single sample of 127 higher education institutions in the 2019 - 2020 academic year.
  • But what if we’re not interested quantifying the relationship between student enrollment, institution type, and football expenditures for this single sample?
  • What if we want to say something about the relationship between these variables for all colleges and universities with football programs and across different years?

Inference for regression

Statistical inference

  • Statistical inference provides methods and tools so we can use the single observed sample to make valid statements (inferences) about the population it comes from

  • For our inferences to be valid, the sample should be representative (ideally random) of the population we’re interested in

Image source: Eugene Morgan © Penn State

Inference for linear regression

  • Inference based on ANOVA

    • Hypothesis test for the statistical significance of the overall regression model

    • Hypothesis test for a subset of coefficients

  • Inference for a single coefficient βj (today’s focus)

    • Hypothesis test for a coefficient βj

    • Confidence interval for a coefficient βj

Linear regression model

y=Model+Error=f(X)+ϵ=E(y|X)+ϵ=Xβ+ϵ

  • We have discussed multiple ways to find the least squares estimates of β=[β0β1]

    • None of these approaches depend on the distribution of ϵ
  • Now we will use statistical inference to draw conclusions about β that depend on particular assumptions about the distribution of ϵ

Linear regression model

Y=Xβ+ϵ,ϵ∼N(0,σϵ2I)

such that the errors are independent and normally distributed.

  • Independent: Knowing the error term for one observation doesn’t tell us about the error term for another observation
  • Normally distributed: The distribution follows a particular mathematical model that is unimodal and symmetric

Describing random phenomena

  • There is some uncertainty in the error terms (and thus the response variable), so we use mathematical models to describe that uncertainty.

  • Some terminology:

    • Sample space: Set of all possible outcomes

    • Random variable: Function (mapping) from the sample space onto real numbers

    • Event: Subset of the sample space, i.e., a set of possible outcomes (possible values the random variable can take)

    • Probability density function: Mathematical function that produces probability of occurrences for events in the sample space for a continuous random variable

Distribution of error terms

The error terms follow a (multivariate) normal distribution with mean 0 and variance σ2I

f(ϵ)=1(2π)n/2|σϵ2I|1/2exp⁡{−12(ϵ−0)T(σϵ2I)−1(ϵ−0)}

Visualizing distribution of y|X

y|X∼N(Xβ,σϵ2I)

Image source: Introduction to the Practice of Statistics (5th ed)

Expected value

Let z=[z1⋮zp] be a p×1 vector of random variables.


Then E(z)=E[z1⋮zp]=[E(z1)⋮E(zp)]

Expected value

Let A be an n×p matrix of constants, C a n×1 vector of constants, and z a p×1 vector of random variables. Then

E(Az)=AE(z)


E(Az+C)=E(Az)+E(C)=AE(z)+C

Expected value of the response

Show E(y|X)=Xβ

Variance


Let z=[z1⋮zp] be a p×1 vector of random variables.


Then Var(z)=[Var(z1)Cov(z1,z2)…Cov(z1,zp)Cov(z2,z1)Var(z2)…Cov(z2,zp)⋮⋮…⋅Cov(zp,z1)Cov(zp,z2)…Var(zp)]

Variance

Let A be an n×p matrix of constants and z a p×1 vector of random variables. Then

Var(z)=E[(z−E(z))(z−E(z))T]


Var(Az)=E[(Az−E(Az))(Az−E(Az))T]=AVar(z)AT

Variance of the response

Show

Var(y|X)=σϵ2I

Linear transformation of normal random variable

Suppose z is a (multivariate) normal random variable such that z∼N(μ,Σ)


A linear transformation of z is also multivariate normal, such that

Az+B∼N(Aμ+B,AΣAT)

Explain why y|X is normally distributed.

Recap

  • Introduced statistical inference in the context of regression

  • Described the assumptions for regression

  • Connected the distribution of residuals and inferential procedures

Next class

  • Confidence intervals for β^j

  • Hypothesis testing based on ANOVA

  • See Prepare for Lecture 09

🔗 STA 221 - Spring 2025

1 / 27
Inference for regression Prof. Maria Tackett Feb 04, 2025

  1. Slides

  2. Tools

  3. Close
  • Inference for regression
  • Announcements
  • Poll: Office hours availability
  • Topics
  • Computing setup
  • Data: NCAA Football expenditures
  • Univariate EDA
  • Bivariate EDA
  • Regression model
  • From sample to population
  • Inference for regression
  • Statistical inference
  • Inference for linear regression
  • Linear regression model
  • Linear regression model
  • Describing random phenomena
  • Distribution of error terms
  • Visualizing distribution of y|X
  • Expected value
  • Expected value
  • Expected value of the response
  • Variance
  • Variance
  • Variance of the response
  • Linear transformation of normal random variable
  • Recap
  • Next class
  • f Fullscreen
  • s Speaker View
  • o Slide Overview
  • e PDF Export Mode
  • r Scroll View Mode
  • b Toggle Chalkboard
  • c Toggle Notes Canvas
  • d Download Drawings
  • ? Keyboard Help