Feb 04, 2025
Lab 03 due TODAY at 11:59pm
Click here to learn more about the Academic Resource Center
Statistics experience due Tuesday, April 22
Understand statistical inference in the context of regression
Describe the assumptions for regression
Understand connection between distribution of residuals and inferential procedures
Conduct inference on a single coefficient
Today’s data come from Equity in Athletics Data Analysis and includes information about sports expenditures and revenues for colleges and universities in the United States. This data set was featured in a March 2022 Tidy Tuesday.
We will focus on the 2019 - 2020 season expenditures on football for institutions in the NCAA - Division 1 FBS. The variables are :
total_exp_m
: Total expenditures on football in the 2019 - 2020 academic year (in millions USD)
enrollment_th
: Total student enrollment in the 2019 - 2020 academic year (in thousands)
type
: institution type (Public or Private)
exp_fit <- lm(total_exp_m ~ enrollment_th + type, data = football)
tidy(exp_fit) |>
kable(digits = 3)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 19.332 | 2.984 | 6.478 | 0 |
enrollment_th | 0.780 | 0.110 | 7.074 | 0 |
typePublic | -13.226 | 3.153 | -4.195 | 0 |
For every additional 1,000 students, we expect an institution’s total expenditures on football to increase by $780,000, on average, holding institution type constant.
For every additional 1,000 students, we expect an institution’s total expenditures on football to increase by $780,000, on average, holding institution type constant.
Statistical inference provides methods and tools so we can use the single observed sample to make valid statements (inferences) about the population it comes from
For our inferences to be valid, the sample should be representative (ideally random) of the population we’re interested in
Inference based on ANOVA
Hypothesis test for the statistical significance of the overall regression model
Hypothesis test for a subset of coefficients
Inference for a single coefficient
Hypothesis test for a coefficient
Confidence interval for a coefficient
We have discussed multiple ways to find the least squares estimates of
Now we will use statistical inference to draw conclusions about
such that the errors are independent and normally distributed.
There is some uncertainty in the error terms (and thus the response variable), so we use mathematical models to describe that uncertainty.
Some terminology:
Sample space: Set of all possible outcomes
Random variable: Function (mapping) from the sample space onto real numbers
Event: Subset of the sample space, i.e., a set of possible outcomes (possible values the random variable can take)
Probability density function: Mathematical function that produces probability of occurrences for events in the sample space for a continuous random variable
The error terms follow a (multivariate) normal distribution with mean
Image source: Introduction to the Practice of Statistics (5th ed)
Let
Then
Let
Show
Let
Then
Let
Show
Suppose
A linear transformation of
Explain why
Introduced statistical inference in the context of regression
Described the assumptions for regression
Connected the distribution of residuals and inferential procedures
Confidence intervals for
Hypothesis testing based on ANOVA