Inference for regression

Cont’d

Author

Prof. Maria Tackett

Published

Feb 06, 2025

Announcements

HW 02 due Thursday, February 13 at 11:59pm
- Released after class
Lecture recordings available until start of exam, February 18 at 10:05am
- See link under “Exam 01” on menu of course website
Statistics experience due Tuesday, April 22

Topics

Understand statistical inference in the context of regression
Describe the assumptions for regression
Understand connection between distribution of residuals and inferential procedures
Conduct inference on a single coefficient

Computing setup

# load packages
library(tidyverse)  
library(tidymodels)  
library(knitr)       
library(kableExtra)  
library(patchwork)   

# set default theme in ggplot2
ggplot2::theme_set(ggplot2::theme_bw())

Data: NCAA Football expenditures

Today’s data come from Equity in Athletics Data Analysis and includes information about sports expenditures and revenues for colleges and universities in the United States. This data set was featured in a March 2022 Tidy Tuesday.

We will focus on the 2019 - 2020 season expenditures on football for institutions in the NCAA - Division 1 FBS. The variables are :

total_exp_m: Total expenditures on football in the 2019 - 2020 academic year (in millions USD)
enrollment_th: Total student enrollment in the 2019 - 2020 academic year (in thousands)
type: institution type (Public or Private)

football <- read_csv("data/ncaa-football-exp.csv")

Univariate EDA

Bivariate EDA

Regression model

exp_fit <- lm(total_exp_m ~ enrollment_th + type, data = football)
tidy(exp_fit) |>
  kable(digits = 3)

term	estimate	std.error	statistic
(Intercept)	19.332	2.984	6.478
enrollment_th	0.780	0.110	7.074
typePublic	-13.226	3.153	-4.195

For every additional 1,000 students, we expect an institution’s total expenditures on football to increase by $780,000, on average, holding institution type constant.

From sample to population

For every additional 1,000 students, we expect an institution’s total expenditures on football to increase by $780,000, on average, holding institution type constant.

. . .

This estimate is valid for the single sample of 127 higher education institutions in the 2019 - 2020 academic year.
But what if we’re not interested quantifying the relationship between student enrollment, institution type, and football expenditures for this single sample?
What if we want to say something about the relationship between these variables for all colleges and universities with football programs and across different years?

Inference for regression

Statistical inference

Statistical inference provides methods and tools so we can use the single observed sample to make valid statements (inferences) about the population it comes from
For our inferences to be valid, the sample should be representative (ideally random) of the population we’re interested in

Image source: Eugene Morgan © Penn State

Linear regression model

$\begin{array}{r} Y = X β + ϵ, ϵ \sim N (0, σ_{ϵ}^{2} I) \end{array}$

such that the errors are independent and normally distributed.

. . .

Independent: Knowing the error term for one observation doesn’t tell us about the error term for another observation
Normally distributed: The distribution follows a particular mathematical model that is unimodal and symmetric

Visualizing distribution of $y | X$

$y | X \sim N (X β, σ_{ϵ}^{2} I)$

Image source: *Introduction to the Practice of Statistics (5th ed)*

Linear transformation of normal random variable

Suppose $z$ is a (multivariate) normal random variable such that $z \sim N (μ, Σ)$ , $A$ is a matrix of constants, and $b$ is a vector of constants.

A linear transformation of $z$ is also multivariate normal, such that

$A z + b \sim N (A μ + b, A Σ A^{T})$

Explain why $y | X$ is normally distributed.

Assumptions for regression

$y | X \sim N (X β, σ_{ϵ}^{2} I)$

Linearity: There is a linear relationship between the response and predictor variables.
Constant Variance: The variability about the least squares line is generally constant.
Normality: The distribution of the residuals is approximately normal.
Independence: The residuals are independent from one another.

Estimating $σ_{ϵ}^{2}$

Once we fit the model, we can use the residuals to estimate $σ_{ϵ}^{2}$
The estimated value ${\hat{σ}}_{ϵ}^{2}$ is needed for hypothesis testing and constructing confidence intervals for regression

${\hat{σ}}_{ϵ}^{2} = \frac{S S R}{n - p - 1} = \frac{e^{T} e}{n - p - 1}$

. . .

The regression standard error ${\hat{σ}}_{ϵ}$ is a measure of the average distance between the observations and regression line

${\hat{σ}}_{ϵ} = \sqrt{\frac{S S R}{n - p - 1}} = {\hat{σ}}_{ϵ} = \sqrt{\frac{e^{T} e}{n - p - 1}}$

Inference for a single coefficient

Inference for $β_{j}$

We often want to conduct inference on individual model coefficients

Hypothesis test: Is there a linear relationship between the response and $x_{j}$ ?
Confidence interval: What is a plausible range of values $β_{j}$ can take?

. . .

But first we need to understand the distribution of ${\hat{β}}_{j}$

Sampling distribution of $\hat{β}$

A sampling distribution is the probability distribution of a statistic for a large number of random samples of size $n$ from a population
The sampling distribution of $\hat{β}$ is the probability distribution of the estimated coefficients if we repeatedly took samples of size $n$ and fit the regression model

$\hat{β} \sim N (β, σ_{ϵ}^{2} (X^{T} X)^{- 1})$

. . .

The estimated coefficients $\hat{β}$ are normally distributed with

$E (\hat{β}) = β V a r (\hat{β}) = σ_{ϵ}^{2} (X^{T} X)^{- 1}$

Expected value of $\hat{β}$

Show

$E (\hat{β}) = β$

Will show $V a r (\hat{β})$ in homework

Sampling distribution of ${\hat{β}}_{j}$

$\hat{β} \sim N (β, σ_{ϵ}^{2} (X^{T} X)^{- 1})$

Let $C = (X^{T} X)^{- 1}$ . Then, for each coefficient ${\hat{β}}_{j}$ ,

$E ({\hat{β}}_{j}) = β_{j}$ , the $j^{t h}$ element of $β$
$V a r ({\hat{β}}_{j}) = σ_{ϵ}^{2} C_{j j}$
$C o v ({\hat{β}}_{i}, {\hat{β}}_{j}) = σ_{ϵ}^{2} C_{i j}$

$V a r (\hat{β})$ for NCAA data

X <- model.matrix(total_exp_m ~ enrollment_th + type, 
                  data = football)
sigma_sq <- glance(exp_fit)$sigma^2

var_beta <- sigma_sq * solve(t(X) %*% X)
var_beta

              (Intercept) enrollment_th typePublic
(Intercept)     8.9054556   -0.13323338 -6.0899556
enrollment_th  -0.1332334    0.01216984 -0.1239408
typePublic     -6.0899556   -0.12394079  9.9388370

$S E (\hat{β})$ for NCAA data

term	estimate	std.error	statistic
(Intercept)	19.332	2.984	6.478
enrollment_th	0.780	0.110	7.074
typePublic	-13.226	3.153	-4.195

sqrt(diag(var_beta))

  (Intercept) enrollment_th    typePublic 
     2.984201      0.110317      3.152592

Hypothesis test for $β_{j}$

Steps for a hypothesis test

State the null and alternative hypotheses.
Calculate a test statistic.
Calculate the p-value.
State the conclusion.

Hypothesis test for $β_{j}$ : Hypotheses

We will generally test the hypotheses:

$\begin{aligned} H_{0} : β_{j} = 0 \\ H_{a} : β_{j} \neq 0 \end{aligned}$

State these hypotheses in words.

Hypothesis test for $β_{j}$ : Test statistic

Test statistic: Number of standard errors the estimate is away from the null

$Test Statistic = \frac{Estimate - Null}{Standard error}$

. . .

If $σ_{ϵ}^{2}$ was known, the test statistic would be

$Z = \frac{{\hat{β}}_{j} - 0}{S E ({\hat{β}}_{j})} = \frac{{\hat{β}}_{j} - 0}{\sqrt{σ_{ϵ}^{2} C_{j j}}} \sim N (0, 1)$

. . .

In general, $σ_{ϵ}^{2}$ is not known, so we use ${\hat{σ}}_{ϵ}^{2}$ to calculate $S E ({\hat{β}}_{j})$

$T = \frac{{\hat{β}}_{j} - 0}{S E ({\hat{β}}_{j})} = \frac{{\hat{β}}_{j} - 0}{\sqrt{{\hat{σ}}_{ϵ}^{2} C_{j j}}} \sim t_{n - p - 1}$

Hypothesis test for $β_{j}$ : Test statistic

The test statistic $T$ follows a $t$ distribution with $n - p - 1$ degrees of freedom.
We need to account for the additional variability introduced by calculating $S E ({\hat{β}}_{j})$ using an estimated value instead of a constant

t vs. N(0,1)

Figure 1: Standard normal vs. t distributions

Hypothesis test for $β_{j}$ : P-value

The p-value is the probability of observing a test statistic at least as extreme (in the direction of the alternative hypothesis) from the null value as the one observed

$p - v a l u e = P (| t | > | test statistic |),$

calculated from a $t$ distribution with $n - p - 1$ degrees of freedom

. . .

Why do we take into account “extreme” on both the high and low ends?

Understanding the p-value

Magnitude of p-value	Interpretation
p-value < 0.01	strong evidence against $H_{0}$
0.01 < p-value < 0.05	moderate evidence against $H_{0}$
0.05 < p-value < 0.1	weak evidence against $H_{0}$
p-value > 0.1	effectively no evidence against $H_{0}$

These are general guidelines. The strength of evidence depends on the context of the problem.

Hypothesis test for $β_{j}$ : Conclusion

There are two parts to the conclusion

Make a conclusion by comparing the p-value to a predetermined decision-making threshold called the significance level ( $α$ level)
- If $P-value < α$ : Reject $H_{0}$
- If $P-value \geq α$ : Fail to reject $H_{0}$
State the conclusion in the context of the data

Application exercise

📋 sta221-sp25.netlify.app/ae/ae-03-inference

Confidence interval for $β_{j}$

A plausible range of values for a population parameter is called a confidence interval
Using only a single point estimate is like fishing in a murky lake with a spear, and using a confidence interval is like fishing with a net
- We can throw a spear where we saw a fish but we will probably miss, if we toss a net in that area, we have a good chance of catching the fish
- Similarly, if we report a point estimate, we probably will not hit the exact population parameter, but if we report a range of plausible values we have a good shot at capturing the parameter

What “confidence” means

We will construct $C %$ confidence intervals.
- The confidence level impacts the width of the interval

“Confident” means if we were to take repeated samples of the same size as our data, fit regression lines using the same predictors, and calculate $C %$ CIs for the coefficient of $x_{j}$ , then $C %$ of those intervals will contain the true value of the coefficient $β_{j}$

Balance precision and accuracy when selecting a confidence level

Confidence interval for $β_{j}$

$Estimate \pm (critical value) \times SE$

. . .

${\hat{β}}_{1} \pm t^{*} \times S E ({\hat{β}}_{j})$

where $t^{*}$ is calculated from a $t$ distribution with $n - p - 1$ degrees of freedom

Confidence interval: Critical value

# confidence level: 95%
qt(0.975, df = nrow(football) - 2 - 1)

[1] 1.97928

# confidence level: 90%
qt(0.95, df = nrow(football) - 2 - 1)

[1] 1.657235

# confidence level: 99%
qt(0.995, df = nrow(football) - 2 - 1)

[1] 2.61606

95% CI for $β_{j}$ : Calculation

term	estimate	std.error	statistic
(Intercept)	19.332	2.984	6.478
enrollment_th	0.780	0.110	7.074
typePublic	-13.226	3.153	-4.195

95% CI for $β_{j}$ in R

tidy(exp_fit, conf.int = TRUE, conf.level = 0.95) |> 
  kable(digits = 3)

term	estimate	std.error	statistic	conf.low	conf.high
(Intercept)	19.332	2.984	6.478	13.426	25.239
enrollment_th	0.780	0.110	7.074	0.562	0.999
typePublic	-13.226	3.153	-4.195	-19.466	-6.986

Interpretation: We are 95% confident that for each additional 1,000 students enrolled, the institution’s expenditures on football will be greater by $562,000 to $999,000, on average, holding institution type constant.

Recap

Introduced statistical inference in the context of regression
Described the assumptions for regression
Connected the distribution of residuals and inferential procedures
Conducted inference on a single coefficient

Next class

Hypothesis testing based on ANOVA

Announcements

Topics

Computing setup

Data: NCAA Football expenditures

Univariate EDA

Bivariate EDA

Regression model

From sample to population

Inference for regression

Statistical inference

Linear regression model

Visualizing distribution of y|X

Linear transformation of normal random variable

Assumptions for regression

Estimating σϵ2

Inference for a single coefficient

Inference for βj

Sampling distribution of β^

Expected value of β^

Sampling distribution of β^j

Var(β^) for NCAA data

SE(β^) for NCAA data

Hypothesis test for βj

Steps for a hypothesis test

Hypothesis test for βj: Hypotheses

Hypothesis test for βj: Test statistic

Hypothesis test for βj: Test statistic

t vs. N(0,1)

Hypothesis test for βj: P-value

Understanding the p-value

Hypothesis test for βj: Conclusion

Application exercise

Confidence interval for βj

Confidence interval for βj

What “confidence” means

Confidence interval for βj

Confidence interval: Critical value

95% CI for βj: Calculation

95% CI for βj in R

Recap

Next class

Visualizing distribution of $y | X$

Estimating $σ_{ϵ}^{2}$

Inference for $β_{j}$

Sampling distribution of $\hat{β}$

Expected value of $\hat{β}$

Sampling distribution of ${\hat{β}}_{j}$

$V a r (\hat{β})$ for NCAA data

$S E (\hat{β})$ for NCAA data

Hypothesis test for $β_{j}$

Hypothesis test for $β_{j}$ : Hypotheses

Hypothesis test for $β_{j}$ : Test statistic

Hypothesis test for $β_{j}$ : Test statistic

Hypothesis test for $β_{j}$ : P-value

Hypothesis test for $β_{j}$ : Conclusion

Confidence interval for $β_{j}$

Confidence interval for $β_{j}$

Confidence interval for $β_{j}$

95% CI for $β_{j}$ : Calculation

95% CI for $β_{j}$ in R