STA 221 - Spring 2025 – Variable transformations cont’d

Variability in life expectancy

Let’s consider a model using a country’s healthcare expenditure, income inequality, and education to predict its life expectancy

Original model

life_exp_fit <- lm(life_exp ~ health_expenditure + income_inequality + education, 
                   data = health_data)

term	estimate	std.error	statistic	p.value
(Intercept)	78.575	1.775	44.274	0.000
health_expenditure	0.001	0.000	4.522	0.000
income_inequality	-0.484	0.061	-7.900	0.000
educationHigh	2.020	1.168	1.730	0.086

Original model: Residuals

Look at residuals vs. each predictor to determine which variable has non-linear relationship with life expectancy.

Residuals vs. predictors

There is a non-linear relationship is between health expenditure and life expectancy.

Log Transformation on $X$

Try a transformation on $X$ if the scatterplot in EDA shows non-linear relationship and residuals vs. fitted looks parabolic

EDA

Model with Transformation on $X_{j}$

When we fit a model with predictor $\log (X_{j})$ , we fit a model of the form

$y = X β + ϵ, ϵ \sim N (0, σ_{ϵ}^{2} I)$

such that $X$ has a column for $\log (X_{j})$ .

The estimated regression model is

$\begin{aligned} \hat{y} & = X \hat{β} \\ \Rightarrow & {\hat{y}}_{i} = {\hat{β}}_{0} + {\hat{β}}_{1} x_{i 1} + \dots + {\hat{β}}_{j} \log (x_{i j}) + \dots + {\hat{β}}_{p} x_{i p} \end{aligned}$

Model interpretation

${\hat{y}}_{i} = {\hat{β}}_{0} + {\hat{β}}_{1} x_{i 1} + \dots + {\hat{β}}_{j} \log (x_{i j}) + \dots + {\hat{β}}_{p} x_{i p}$

Intercept: When $x_{i 1} = \dots = \log (x_{i j}) = \dots = x_{i p} = 0$ , $y_{i}$ is expected to be ${\hat{β}}_{0}$ , on average.
- $\log (x_{i j}) = 0$ when $x_{i j} = 1$
Coefficient of $X_{j}$ : When $x_{i j}$ is multiplied by a factor of $C$ , $y_{i}$ is expected to change by ${\hat{β}}_{j} \log (C)$ units, on average, holding all else constant.
- Example: When $x_{i j}$ is multiplied by a factor of 2, $y_{i}$ is expected to increase by ${\hat{β}}_{j} \log (2)$ units, on average, holding all else constant.

Model with log(X)

life_exp_logx_fit <- lm(life_exp ~ log(health_expenditure) + income_inequality 
                        + education, data = health_data)

term	estimate	std.error	statistic	p.value
(Intercept)	59.151	3.184	18.576	0.000
log(health_expenditure)	3.092	0.396	7.814	0.000
income_inequality	-0.362	0.058	-6.225	0.000
educationHigh	-0.168	1.103	-0.152	0.879

Interpret the intercept in the context of the data.
Interpret the effect of health expenditure in the context of the data.
Interpret the effect of education in the context of the data.

Model with log(X): Residuals

Comparing residual plots

Is a model with log-transformed response and/or predictor still a “linear” model?

“Linear” model

What does it mean for a model to be a “linear” model?

Linear models are linear in the parameters, i.e. given an observation $y_{i}$

$y_{i} = β_{0} + β_{1} f (x_{i 1}) + \dots + β_{p} f (x_{i p}) + ϵ_{i}$
The functions $f_{1}, \dots, f_{p}$ can be non-linear as long as $β_{0}, β_{1}, \dots, β_{p}$ are linear in $Y$

Identify the linear models

$y_{i} = β_{0} + β_{1} x_{i 1} + β_{2} x_{i 1}^{2} + β_{3} x_{i 2} + ϵ_{i}$
$y_{i} = β_{1} x_{i 1} + β_{2} x_{i 2} + β_{3} x_{i 1} x_{i 2} + ϵ_{i}$
$y_{i} = β_{0} + β_{1} \sin (x_{i 1} + β_{2} x_{i 2}) + β_{3} x_{i 3} + ϵ_{i}$
$y_{i} = β_{0} + β_{1} e^{x_{i 1}} + β_{2} e^{x_{i 2}} + ϵ_{i}$
$y_{i} = e^{(β_{0} + β_{1} x_{i 1} + β_{2} x_{i 2} + β_{3} x_{i 3})} + ϵ_{i}$

Learn more

See Log Transformations in Linear Regression for more details about interpreting regression models with log-transformed variables.

Recap

Introduced log-transformation on the predictor
Identified linear models

Remaining questions?

Please submit any questions you have about multicollinearity and variable transformations.

References

Zarulli, Virginia, Elizaveta Sopina, Veronica Toffolutti, and Adam Lenart. 2021. “Health Care System Efficiency and Life Expectancy: A 140-Country Study.” Edited by Srinivas Goli. PLOS ONE 16 (7): e0253450. https://doi.org/10.1371/journal.pone.0253450.

Variable transformations cont’d

Announcements

Computing set up

Topics

Variable transformations

Data: Life expectancy in 140 countries

Variables