Inference for regression

Cont’d

Prof. Maria Tackett

Feb 11, 2025

Announcements

  • Research topics due TODAY at 11:59pm on GitHub

  • HW 02 due Thursday at 11:59pm

  • Statistics experience due Tuesday, April 22

Exam 01

  • 50 points total

    • in-class: 35-40 points
    • take-home: 10 - 15 points
  • In-class (35 -40 pts): 75 minutes during February 18 lecture

    • Will be randomly assigned to exam room
  • Take-home (10 -15 pts): released after class on Tuesday

  • If you miss any part of the exam for an excused absence (with academic dean’s note or other official documentation), your Exam 02 score will be counted twice

Resources

  • Exam 01 practice

  • Lecture recordings

  • Prepare readings (see course schedule)

  • Lecture notes (use search bar to find specific topics)

  • AEs

  • Assignments

Topics

  • Conduct inference on a single coefficient

Computing setup

# load packages
library(tidyverse)  
library(tidymodels)  
library(knitr)       
library(kableExtra)  
library(patchwork)   

# set default theme in ggplot2
ggplot2::theme_set(ggplot2::theme_bw())

Data: NCAA Football expenditures

Today’s data come from Equity in Athletics Data Analysis and includes information about sports expenditures and revenues for colleges and universities in the United States. This data set was featured in a March 2022 Tidy Tuesday.

We will focus on the 2019 - 2020 season expenditures on football for institutions in the NCAA - Division 1 FBS. The variables are :

  • total_exp_m: Total expenditures on football in the 2019 - 2020 academic year (in millions USD)

  • enrollment_th: Total student enrollment in the 2019 - 2020 academic year (in thousands)

  • type: institution type (Public or Private)

football <- read_csv("data/ncaa-football-exp.csv")

Regression model

exp_fit <- lm(total_exp_m ~ enrollment_th + type, data = football)
tidy(exp_fit) |>
  kable(digits = 3)
term estimate std.error statistic p.value
(Intercept) 19.332 2.984 6.478 0
enrollment_th 0.780 0.110 7.074 0
typePublic -13.226 3.153 -4.195 0

Inference for a single coefficient

Inference for βj

We often want to conduct inference on individual model coefficients

  • Hypothesis test: Is there a linear relationship between the response and xj?

  • Confidence interval: What is a plausible range of values βj can take?

Sampling distribution of β^

  • A sampling distribution is the probability distribution of a statistic for a large number of random samples of size n from a population

  • The sampling distribution of β^ is the probability distribution of the estimated coefficients if we repeatedly took samples of size n and fit the regression model

β^∼N(β,σϵ2(XTX)−1)

The estimated coefficients β^ are normally distributed with

E(β^)=βVar(β^)=σϵ2(XTX)−1

Sampling distribution of β^j

β^∼N(β,σϵ2(XTX)−1)

Let C=(XTX)−1. Then, for each coefficient β^j,

  • E(β^j)=βj, the jth element of β

  • Var(β^j)=σϵ2Cjj

  • Cov(β^i,β^j)=σϵ2Cij

Hypothesis test for βj

Steps for a hypothesis test

  1. State the null and alternative hypotheses.
  2. Calculate a test statistic.
  3. Calculate the p-value.
  4. State the conclusion.


Let’s walk through the steps to test βj, the coefficient for typePublic .

Hypothesis test for βj: Hypotheses

  • Null: There is no linear relationship between institution type and football expenditure, after adjusting for enrollment H0:βj=0

  • Alternative: There is a linear relationship between institution type and football expenditure, after adjusting for enrollment Ha:βj≠0

Hypothesis test for βj: Test statistic

term estimate std.error statistic p.value
(Intercept) 19.332 2.984 6.478 0
enrollment_th 0.780 0.110 7.074 0
typePublic -13.226 3.153 -4.195 0

Test statistic: Number of standard errors the estimate is away from the null

Test Statistic=Estimate - NullStandard error=−13.226−03.153=−4.195

This means the estimated slope of -13.226 is 4.195 standard errors below the hypothesized mean of 0.

Hypothesis test for βj: p-value

  • The test statistic follows a t distribution with 124 degrees of freedom.

p−value=P(|T|>|−4.195|)

2 * pt(4.195, df = nrow(football) - 2 - 1, lower.tail = FALSE)
[1] 0.00005153923


Given βj=0 ( H0 is true), the probability of observing a slope of -13.226 or more extreme is ≈0 .

Hypothesis test for βj: Conclusion

  • The p-value is ≈0, so we reject H0.

  • The data provide sufficient evidence that βj≠0, meaning evidence there is a linear relationship between institution type and football expenditure, after adjusting for enrollment.

Confidence interval for βj

Confidence interval for βj

  • A plausible range of values for a population parameter is called a confidence interval

  • Using only a single point estimate is like fishing in a murky lake with a spear, and using a confidence interval is like fishing with a net

    • We can throw a spear where we saw a fish but we will probably miss, if we toss a net in that area, we have a good chance of catching the fish

    • Similarly, if we report a point estimate, we probably will not hit the exact population parameter, but if we report a range of plausible values we have a good shot at capturing the parameter

What “confidence” means

  • We will construct C% confidence intervals.

    • The confidence level impacts the width of the interval
  • “Confident” means if we were to take repeated samples of the same size as our data, fit regression lines using the same predictors, and calculate C% Cs for the coefficient of xj, then C% of those intervals will contain the true value of the coefficient βj

  • Balance precision and accuracy when selecting a confidence level

Confidence interval for βj

Estimate± (critical value) ×SE


β^1±t∗×SE(β^j)

where t∗ is calculated from a t distribution with n−p−1 degrees of freedom

Confidence interval: Critical value

# confidence level: 95%
qt(0.975, df = nrow(football) - 2 - 1)
[1] 1.97928


# confidence level: 90%
qt(0.95, df = nrow(football) - 2 - 1)
[1] 1.657235


# confidence level: 99%
qt(0.995, df = nrow(football) - 2 - 1)
[1] 2.61606

95% CI for βj: Calculation

term estimate std.error statistic p.value
(Intercept) 19.332 2.984 6.478 0
enrollment_th 0.780 0.110 7.074 0
typePublic -13.226 3.153 -4.195 0

95% CI for βj in R

tidy(exp_fit, conf.int = TRUE, conf.level = 0.95) |> 
  kable(digits = 3)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 19.332 2.984 6.478 0 13.426 25.239
enrollment_th 0.780 0.110 7.074 0 0.562 0.999
typePublic -13.226 3.153 -4.195 0 -19.466 -6.986


Interpretation: We are 95% confident that for each additional 1,000 students enrolled, the institution’s expenditures on football will be greater by $562,000 to $999,000, on average, holding institution type constant.

Application exercise

📋 sta221-sp25.netlify.app/ae/ae-03-inference.html

Recap

  • Conducted hypothesis tests for a single coefficient βj

  • Computed and interpreted confidence intervals for a single coefficient βj

Next class

  • Exam 01 review

🔗 STA 221 - Spring 2025

1 / 28
Inference for regression Cont’d Prof. Maria Tackett Feb 11, 2025

  1. Slides

  2. Tools

  3. Close
  • Inference for regression
  • Announcements
  • Exam 01
  • Resources
  • Topics
  • Computing setup
  • Data: NCAA Football expenditures
  • Regression model
  • Inference for a single coefficient
  • Inference for βj
  • Sampling distribution of β^
  • Sampling distribution of β^j
  • Hypothesis test for βj
  • Steps for a hypothesis test
  • Hypothesis test for βj: Hypotheses
  • Hypothesis test for βj: Test statistic
  • Hypothesis test for βj: p-value
  • Hypothesis test for βj: Conclusion
  • Confidence interval for βj
  • Confidence interval for βj
  • What “confidence” means
  • Confidence interval for βj
  • Confidence interval: Critical value
  • 95% CI for βj: Calculation
  • 95% CI for βj in R
  • Application exercise
  • Recap
  • Next class
  • f Fullscreen
  • s Speaker View
  • o Slide Overview
  • e PDF Export Mode
  • r Scroll View Mode
  • b Toggle Chalkboard
  • c Toggle Notes Canvas
  • d Download Drawings
  • ? Keyboard Help