Logistic Regression: Assumptions + Estimation

Prof. Maria Tackett

Apr 10, 2025

Announcements

  • HW 04 due TODAY at 11:59pm

  • Project draft due at beginning of lab tomorrow

    • Peer review in lab
  • Exam 02 - April 17 (same format as Exam 01)

    • Exam 02 practice + lecture recordings available
  • Statistics experience due April 22

Questions from this week’s content?

Topics

  • Confidence intervals for logistic regression

  • Conditions for logistic regression

  • Estimating coefficients for logistic regression model

Computational setup

library(tidyverse)
library(tidymodels)
library(knitr)
library(kableExtra)
library(Stat2Data)  #empirical logit plots
library(patchwork)

# set default theme in ggplot2
ggplot2::theme_set(ggplot2::theme_bw())

COVID-19 infection prevention practices at food establishments

Researchers at Wollo University in Ethiopia conducted a study in July and August 2020 to understand factors associated with good COVID-19 infection prevention practices at food establishments. Their study is published in Andualem et al. (2022).

They were particularly interested in the understanding implementation of prevention practices at food establishments, given the workers’ increased risk due to daily contact with customers.

Access to personal protective equipment

We will use the data from Andualem et al. (2022) to explore the association between age, sex, years of service, and whether someone works at a food establishment with access to personal protective equipment (PPE) as of August 2020. We will use access to PPE as a proxy for wearing PPE.

The study participants were selected using a simple random sampling at the selected establishments.

age sex years ppe_access
34 Male 2 1
32 Female 3 1
32 Female 1 1
40 Male 4 1
32 Male 10 1

Full model results

Confidence interval for βj

We can calculate the C% confidence interval for βj as the following:

β^j±z∗×SE(β^j)

where z∗ is calculated from the N(0,1) distribution

This is an interval for the change in the log-odds for every one unit increase in xj

Interpretation in terms of the odds

The change in odds for every one unit increase in xj.

exp⁡{β^j±z∗×SE(β^j)}

Interpretation: We are C% confident that for every one unit increase in xj, the odds multiply by a factor of exp⁡{β^j−z∗×SE(β^j)} to exp⁡{β^j+z∗×SE(β^j)}, holding all else constant.

PPE Access: Interpret CI

Interpret the 95% confidence interval for > 5 years experience in terms of the odds of having access to PPE.

Visualizations

Bivariate EDA: categorical predictor

Bivariate EDA: quantitative predictor

EDA: Potential interaction effect

Empirical logit

The empirical logit is the log of the observed odds:

logit(p^)=log⁡(p^1−p^)=log⁡(#Yes#No)

Calculating empirical logit (categorical predictor)

If the predictor is categorical, we can calculate the empirical logit for each level of the predictor.

covid_df |>
  count(sex, ppe_access) |>
  group_by(sex) |>
  mutate(prop = n/sum(n)) |>
  mutate(emp_logit = log(prop/(1-prop)))
# A tibble: 4 × 5
# Groups:   sex [2]
  sex    ppe_access     n  prop emp_logit
  <fct>  <fct>      <int> <dbl>     <dbl>
1 Female 0            114 0.525     0.101
2 Female 1            103 0.475    -0.101
3 Male   0             65 0.353    -0.605
4 Male   1            119 0.647     0.605

Calculating empirical logit (quantitative predictor)

  1. Divide the range of the predictor into intervals with approximately equal number of cases. (If you have enough observations, use 5 - 10 intervals.)

  2. Compute the empirical logit for each interval

You can then calculate the mean value of the predictor in each interval and create a plot of the empirical logit versus the mean value of the predictor in each interval.

Empirical logit plot in R (quantitative predictor)

Created using dplyr and ggplot functions.

Empirical logit plot in R (quantitative predictor)

Created using dplyr and ggplot functions.

covid_df |> 
  mutate(age_bin = cut_number(age, n = 10)) |>
  group_by(age_bin) |>
  mutate(mean_age = mean(age)) |>
  count(mean_age, ppe_access) |>
  mutate(prop = n/sum(n)) |>
  filter(ppe_access == "1") |>
  mutate(emp_logit = log(prop/(1-prop))) |>
  ggplot(aes(x = mean_age, y = emp_logit)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Mean Age", 
       y = "Empirical logit", 
       title = "Empirical logit of PPE Access vs. Age")

Empirical logit plot in R (quantitative predictor)

Using the emplogitplot1 function from the Stat2Data R package

emplogitplot1(ppe_access ~ age,  data = covid_df, ngroups = 10)

Empirical logit plot in R (interactions)

Using the emplogitplot2 function from the Stat2Data R package

emplogitplot2(ppe_access ~ age + sex, data = covid_df, 
              ngroups = 10, 
              putlegend = "bottomright")

Logistic regression model

ppe_model <- glm(ppe_access ~ age + sex + years, 
                 data = covid_df, family = binomial)
tidy(ppe_model, conf.int = TRUE) |>
  kable(digits = 3)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) -2.127 0.458 -4.641 0.000 -3.058 -1.257
age 0.056 0.017 3.210 0.001 0.023 0.091
sexMale 0.341 0.224 1.524 0.128 -0.098 0.780
years 0.264 0.066 4.010 0.000 0.143 0.401

Visualizing coefficient estimates

model_odds_ratios <- tidy(ppe_model, exponentiate = TRUE, conf.int = TRUE)
ggplot(data = model_odds_ratios, aes(x = term, y = estimate)) +
  geom_point() +
  geom_hline(yintercept = 1, lty = 2) + 
  geom_pointrange(aes(ymin = conf.low, ymax = conf.high))+
  labs(title = "Adjusted odds ratios",
       x = "",
       y = "Estimated AOR") +
  coord_flip()

Assumptions for logistic regression

Assumptions for logistic regression

  • Linearity: The log-odds have a linear relationship with the predictors.

  • Randomness: The data were obtained from a random process

  • Independence: The observations are independent from one another.

Checking linearity

Check the empirical logit plots for the quantitative predictors

emplogitplot1(ppe_access ~ age, data = covid_df, 
              ngroups = 10)

emplogitplot1(ppe_access ~ years, data = covid_df, 
              ngroups = 5)

✅ The linearity condition is satisfied. There is generally a linear relationship between the empirical logit and the quantitative predictor variables

Checking randomness

We can check the randomness condition based on the context of the data and how the observations were collected.

  • Was the sample randomly selected?

  • If the sample was not randomly selected, the condition is still satisfied if the sample is representative of the population

✅ The randomness condition is satisfied. The paper states the participants were selected using simple random sampling at selected establishments. We can reasonably treat this sample as representative of the population.

Checking independence

  • We can check the independence condition based on the context of the data and how the observations were collected.

  • Independence is most often violated if the data were collected over time or there is a strong spatial relationship between the observations.

✅ We will treat this sample as independent. If given the data, we may want to further investigate potential correlation within an establishment.

Estimating β

Estimating β

Recall that the coefficients for logistic regression are estimated using maximum likelihood estimation.

logL(β|x1,…,xn,y1,…,yn)=∑i=1nyixiTβ−∑i=1nlog⁡(1+exp⁡{xiTβ})

Estimating β

Take the derivative and set it equal to 0 to solve for β. (Click here for the full derivation.)
∂log⁡L∂β=∑i=1nyixiT−∑i=1nexp⁡{xiTβ}xiT1+exp⁡{xiTβ}=0

There is no closed form solution. We can find the solution numerically using the Newton-Raphson method.

Newton-Raphson method

Newton-Raphson is a numerical method for finding solutions to f(x)=0 . Steps of the Newton-Raphson method:

  1. Start with an initial guess θ(0) .

  2. For each iteration,

    θ(t+1)=θ(t)−f(θ(t))f′(θ(t))

  3. Stop when the convergence criteria is satisfied.

Newton-Raphson method

Image source: LibreTexts-Mathematics

Example

Let’s find the solution (root) of the function f(x)=x3−20

Example

Suppose the convergence criteria be Δ=|θ(t+1)−θ(t)|<0.001 . We’ll start with an initial guess of θ(0)=3

θ(1)=3−33−203∗32=2.740741

Δ=|2.740741−3|=0.259259

θ(2)=2.740741−2.7407413−203∗2.7407412=2.71467

Δ=|2.71467−2.740741|=0.026071

Example

θ(3)=2.71467−2.714673−203∗2.714672=2.714418

Δ=|2.714418−2.71467|=0.000252<0.001

The solution is ≈2.714418

Score vector & Hessian

Given parameter θ=[θ1,…,θp]T and log-likelihood, log⁡L(θ|X) , then

Score vector =∇θlog⁡L=[∂log⁡L∂θ1⋮∂log⁡L∂θp]

Hessian=∇θ2log⁡L=[∂2log⁡L∂θ12∂2log⁡L∂θ1θ2…∂2log⁡L∂θ1θp∂2log⁡L∂θ2θ1∂2log⁡L∂θ22…∂2log⁡L∂θ2θp⋮⋮…⋮∂2log⁡L∂θpθ1∂2log⁡L∂θpθ2…∂2log⁡L∂θp2]

Newton-Raphson for logistic regression

  1. Start with an initial guess β(0) .

  2. For each iteration,

    β(t+1)=β(t)−(∇β2log⁡L(β(t)|X))−1(∇βlog⁡L(β(t)|X))

  3. Stop when the convergence criteria is satisfied.

Newton-Raphson for logistic regression

log⁡L=∑i=1nyixiTβ−∑i=1nlog⁡(1+exp⁡{xiTβ})


∇βlog⁡L=∑i=1n[yi−exp⁡{xiTβ}1+exp⁡{xiTβ}]xi


∇β2log⁡L= −∑i=1n(11+exp⁡{xiTβ})(exp⁡{xiTβ}1+exp⁡{xiTβ})xixiT

PPE access example

📋 https://sta221-sp25.netlify.app/ae/ae-06-newton-raphson.html

Questions from this week’s content?

References

Andualem, Atsedemariam, Belachew Tegegne, Sewunet Ademe, Tarikuwa Natnael, Gete Berihun, Masresha Abebe, Yeshiwork Alemnew, et al. 2022. “COVID-19 Infection Prevention Practices Among a Sample of Food Handlers of Food and Drink Establishments in Ethiopia.” PLoS One 17 (1): e0259851.

🔗 STA 221 - Spring 2025

1 / 44
Logistic Regression: Assumptions + Estimation Prof. Maria Tackett Apr 10, 2025

  1. Slides

  2. Tools

  3. Close
  • Logistic Regression: Assumptions + Estimation
  • Announcements
  • Questions from this week’s content?
  • Topics
  • Computational setup
  • COVID-19 infection prevention practices at food establishments
  • Access to personal protective equipment
  • Full model results
  • Confidence interval for βj
  • Interpretation in terms of the odds
  • PPE Access: Interpret CI
  • Visualizations
  • Bivariate EDA: categorical predictor
  • Bivariate EDA: quantitative predictor
  • EDA: Potential interaction effect
  • Empirical logit
  • Calculating empirical logit (categorical predictor)
  • Calculating empirical logit (quantitative predictor)
  • Empirical logit plot in R (quantitative predictor)
  • Empirical logit plot in R (quantitative predictor)
  • Empirical logit plot in R (quantitative predictor)
  • Empirical logit plot in R (interactions)
  • Logistic regression model
  • Visualizing coefficient estimates
  • Assumptions for logistic regression
  • Assumptions for logistic regression
  • Checking linearity
  • Checking randomness
  • Checking independence
  • Estimating β
  • Estimating β
  • Estimating β
  • Newton-Raphson method
  • Newton-Raphson method
  • Example
  • Example
  • Slide 37
  • Example
  • Score vector & Hessian
  • Newton-Raphson for logistic regression
  • Newton-Raphson for logistic regression
  • PPE access example
  • Questions from this week’s content?
  • References
  • f Fullscreen
  • s Speaker View
  • o Slide Overview
  • e PDF Export Mode
  • r Scroll View Mode
  • b Toggle Chalkboard
  • c Toggle Notes Canvas
  • d Download Drawings
  • ? Keyboard Help