Variable transformations cont’d

Author

Prof. Maria Tackett

Published

Mar 06, 2025

Announcements

  • HW 03 due March 20 at 11:59pm

  • Next project milestone: Exploratory data analysis due March 20

    • Work on it in lab March 7


Have a good spring break! 😎

Computing set up

# load packages
library(tidyverse)  
library(tidymodels)  
library(knitr)       
library(patchwork)

# set default theme in ggplot2
ggplot2::theme_set(ggplot2::theme_bw())

Topics

  • Log-transformation on the predictor
  • Identify linear models

Variable transformations

Data: Life expectancy in 140 countries

The data set comes from Zarulli et al. () who analyze the effects of a country’s healthcare expenditures and other factors on the country’s life expectancy. The data are originally from the Human Development Database and World Health Organization.

There are 140 countries (observations) in the data set.

Click here for the original research paper.

Variables

  • life_exp: The average number of years that a newborn could expect to live, if he or she were to pass through life exposed to the sex- and age-specific death rates prevailing at the time of his or her birth, for a specific year, in a given country, territory, or geographic income_inequality. ( from the World Health Organization)

  • income_inequality: Measure of the deviation of the distribution of income among individuals or households within a country from a perfectly equal distribution. A value of 0 represents absolute equality, a value of 100 absolute inequality (based on Gini coefficient). (from Zarulli et al. ())

Variables

  • education: Indicator of whether a country’s education index is above (High) or below (Low) the median index for the 140 countries in the data set.

    • Education index: Average of mean years of schooling (of adults) and expected years of school (of children), both expressed as an index obtained by scaling wit the corresponding maxima.
  • health_expend: Per capita current spending on on healthcare goods and services, expressed in respective currency - international Purchasing Power Parity (PPP) dollar (from the World Health Organization)

Log transformation on a predictor variable

Variability in life expectancy

Let’s consider a model using a country’s healthcare expenditure, income inequality, and education to predict its life expectancy

Original model

life_exp_fit <- lm(life_exp ~ health_expenditure + income_inequality + education, 
                   data = health_data)
term estimate std.error statistic p.value
(Intercept) 78.575 1.775 44.274 0.000
health_expenditure 0.001 0.000 4.522 0.000
income_inequality -0.484 0.061 -7.900 0.000
educationHigh 2.020 1.168 1.730 0.086

Original model: Residuals

Look at residuals vs. each predictor to determine which variable has non-linear relationship with life expectancy.

Residuals vs. predictors

. . .

There is a non-linear relationship is between health expenditure and life expectancy.

Log Transformation on X

Try a transformation on X if the scatterplot in EDA shows non-linear relationship and residuals vs. fitted looks parabolic

EDA

Model with Transformation on Xj

When we fit a model with predictor log(Xj), we fit a model of the form

y=Xβ+ϵ,ϵN(0,σϵ2I)

such that X has a column for log(Xj) .

. . .

The estimated regression model is

y^=Xβ^y^i=β^0+β^1xi1++β^jlog(xij)++β^pxip

Model interpretation

y^i=β^0+β^1xi1++β^jlog(xij)++β^pxip

  • Intercept: When xi1==log(xij)==xip=0 , yi is expected to be β^0, on average.

    • log(xij)=0 when xij=1
  • Coefficient of Xj: When xij is multiplied by a factor of C, yi is expected to change by β^jlog(C) units, on average, holding all else constant.

    • Example: When xij is multiplied by a factor of 2, yi is expected to increase by β^jlog(2) units, on average, holding all else constant.

Model with log(X)

life_exp_logx_fit <- lm(life_exp ~ log(health_expenditure) + income_inequality 
                        + education, data = health_data)
term estimate std.error statistic p.value
(Intercept) 59.151 3.184 18.576 0.000
log(health_expenditure) 3.092 0.396 7.814 0.000
income_inequality -0.362 0.058 -6.225 0.000
educationHigh -0.168 1.103 -0.152 0.879


  • Interpret the intercept in the context of the data.

  • Interpret the effect of health expenditure in the context of the data.

  • Interpret the effect of education in the context of the data.

Model with log(X): Residuals

Comparing residual plots




Is a model with log-transformed response and/or predictor still a “linear” model?

“Linear” model

What does it mean for a model to be a “linear” model?

  • Linear models are linear in the parameters, i.e. given an observation yi

    yi=β0+β1f(xi1)++βpf(xip)+ϵi

  • The functions f1,,fp can be non-linear as long as β0,β1,,βp are linear in Y

Identify the linear models

  1. yi=β0+β1xi1+β2xi12+β3xi2+ϵi

  2. yi=β1xi1+β2xi2+β3xi1xi2+ϵi

  3. yi=β0+β1sin(xi1+β2xi2)+β3xi3+ϵi

  4. yi=β0+β1exi1+β2exi2+ϵi

  5. yi=e(β0+β1xi1+β2xi2+β3xi3)+ϵi

Learn more

See Log Transformations in Linear Regression for more details about interpreting regression models with log-transformed variables.

Recap

  • Introduced log-transformation on the predictor
  • Identified linear models

Remaining questions?

Please submit any questions you have about multicollinearity and variable transformations.

References

Zarulli, Virginia, Elizaveta Sopina, Veronica Toffolutti, and Adam Lenart. 2021. “Health Care System Efficiency and Life Expectancy: A 140-Country Study.” Edited by Srinivas Goli. PLOS ONE 16 (7): e0253450. https://doi.org/10.1371/journal.pone.0253450.