Variable transformations

Author

Prof. Maria Tackett

Published

Mar 04, 2025

Computing set up

# load packages
library(tidyverse)  
library(tidymodels)  
library(knitr)       
library(patchwork)

# set default theme in ggplot2
ggplot2::theme_set(ggplot2::theme_bw())

Topics

Log-transformation on the response

Variable transformations

Data: Life expectancy in 140 countries

The data set comes from Zarulli et al. (2021) who analyze the effects of a country’s healthcare expenditures and other factors on the country’s life expectancy. The data are originally from the Human Development Database and World Health Organization.

There are 140 countries (observations) in the data set.

Click here for the original research paper.

Variables

life_exp: The average number of years that a newborn could expect to live, if he or she were to pass through life exposed to the sex- and age-specific death rates prevailing at the time of his or her birth, for a specific year, in a given country, territory, or geographic income_inequality. ( from the World Health Organization)
income_inequality: Measure of the deviation of the distribution of income among individuals or households within a country from a perfectly equal distribution. A value of 0 represents absolute equality, a value of 100 absolute inequality (based on Gini coefficient). (from Zarulli et al. (2021))

Variables

education: Indicator of whether a country’s education index is above (High) or below (Low) the median index for the 140 countries in the data set.
- Education index: Average of mean years of schooling (of adults) and expected years of school (of children), both expressed as an index obtained by scaling wit the corresponding maxima.
health_expend: Per capita current spending on on healthcare good sand services, expressed in respective currency - international Purchasing Power Parity (PPP) dollar (from the World Health Organization)

Exploratory data analysis

Exploratory data analysis

The goal is to use income inequality and education to understand variability in health expenditure

Original model

health_fit <- lm(health_expenditure ~ income_inequality + education, 
                     data = health_data)

term	estimate	std.error	statistic	p.value
(Intercept)	2070.599	534.653	3.873	0.000
income_inequality	-64.346	18.626	-3.455	0.001
educationHigh	1039.298	359.736	2.889	0.004

Original model: Residuals vs. fitted

What model assumption(s) appear to be violated?

Consider different transformations…

Transformation on $Y$

Identifying a need to transform Y

Typically, a “fan-shaped” residual plot indicates the need for a transformation of the response variable Y
- There are multiple ways to transform a variable, e.g., $Y^{1 / 2}$ , $1 / Y$ , $\log (Y)$ . These are called variance stabilizing transformations
- $\log (Y)$ the most straightforward to interpret, so we use that transformation when possible

When building a model:
- Choose a transformation and build the model on the transformed data
- Reassess the residual plots
- If the residuals plots did not sufficiently improve, try a new transformation!

Log transformation on $Y$

If we apply a log transformation to the response variable, we want to estimate the parameters for the statistical model

$\log (Y) = X β + ϵ, ϵ \sim N (0, σ_{ϵ}^{2} I)$

The regression equation is

$\hat{\log (Y)} = X \hat{β}$

Log transformation on $Y$

We fit the model in terms of $\log (Y)$ but want to interpret the model in terms of the original variable $Y$ , so we need to write the regression equation in terms of $Y$

$\begin{aligned} \hat{\log (Y)} = X \hat{β} \\ \Rightarrow & \hat{Y} = e^{X \hat{β}} \end{aligned}$

Model interpretation

$\begin{aligned} \hat{y_{i}} & = e^{x_{i} \hat{β}} \\ = e^{({\hat{β}}_{0} + {\hat{β}}_{1} x_{i 1} + \dots + {\hat{β}}_{p} x_{i p})} \\ = e^{{\hat{β}}_{0}} e^{{\hat{β}}_{1} x_{i 1}} \dots e^{{\hat{β}}_{p} x_{i p}} \end{aligned}$

. . .

Intercept: When $x_{i 1} = \dots = x_{i p} = 0$ , $y_{i}$ is expected to be $e^{{\hat{β}}_{0}}$
Coefficient of $X_{j}$ : For every one unit increase in $x_{i j}$ , $y_{i}$ is expected to multiply by a factor of $e^{{\hat{β}}_{j}}$ , holding all else constant.

Model with log(Y)

term	estimate	std.error	statistic
(Intercept)	7.096	0.324	21.895
income_inequality	-0.065	0.011	-5.714
educationHigh	1.117	0.218	5.121

Interpret each of the following in terms of health expenditure

Intercept
income_inequality
education

Model with log(Y): Residuals

Compare residual plots

Learn more

See Log Transformations in Linear Regression for more details about interpreting regression models with log-transformed variables.

Recap

Log-transformation on the response

References

Zarulli, Virginia, Elizaveta Sopina, Veronica Toffolutti, and Adam Lenart. 2021. “Health Care System Efficiency and Life Expectancy: A 140-Country Study.” Edited by Srinivas Goli. PLOS ONE 16 (7): e0253450. https://doi.org/10.1371/journal.pone.0253450.

Computing set up

Topics

Variable transformations

Data: Life expectancy in 140 countries

Variables

Variables

Exploratory data analysis

Exploratory data analysis

Original model

Original model: Residuals vs. fitted

Consider different transformations…

Transformation on Y

Identifying a need to transform Y

Log transformation on Y

Log transformation on Y

Model interpretation

Model with log(Y)

Model with log(Y): Residuals

Compare residual plots

Learn more

Recap

References

Transformation on $Y$

Log transformation on $Y$

Log transformation on $Y$