SLR: Matrix representation

Author

Prof. Maria Tackett

Published

Jan 21, 2025

Announcements

Lab 01 due on TODAY at 11:59pm
- Push work to GitHub repo
- Submit final PDF on Gradescope + mark pages for each question
HW 01 will be assigned on Thursday

Topics

Application exercise on model assessment
Matrix representation of simple linear regression
- Model form
- Least square estimate
- Predicted (fitted) values
- Residuals

Model assessment

Two statistics

Root mean square error, RMSE: A measure of the average error (average difference between observed and predicted values of the outcome)

$R M S E = \sqrt{\frac{\sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2}}{n}} = \sqrt{\frac{\sum_{i = 1}^{n} e_{i}^{2}}{n}}$
R-squared, $R^{2}$ : Percentage of variability in the outcome explained by the regression model (in the context of SLR, the predictor)

$R^{2} = \frac{S S M}{S S T} = 1 - \frac{S S R}{S S T}$

Application exercise

📋 sta221-sp25.netlify.app/ae/ae-01-model-assessment.html

Open ae-01 from last class. Complete Part 2.

Matrix representation of simple linear regression

SLR: Statistical model (population)

When we have a quantitative response, $Y$ , and a single quantitative predictor, $X$ , we can use a simple linear regression model to describe the relationship between $Y$ and $X$ .

$Y = β_{0} + β_{1} X + ϵ$

$β_{1}$ : Population (true) slope of the relationship between $X$ and $Y$
$β_{0}$ : Population (true) intercept of the relationship between $X$ and $Y$
$ϵ$ : Error terms centered at 0 with variance $σ_{ϵ}^{2}$

SLR in matrix form

The simple linear regression model can be represented using vectors and matrices as

$y = X β + ϵ$

$y$ : Vector of responses
$X$ : Design matrix (columns for predictors + intercept)
$β$ : Vector of model coefficients
$ϵ$ : Vector of error terms centered at $0$ with variance $σ_{ϵ}^{2} I$

SLR in matrix form

$\underset{y}{\underset{⏟}{[\begin{matrix} y_{1} \\ ⋮ \\ y_{n} \end{matrix}]}} = \underset{X}{\underset{⏟}{[\begin{matrix} 1 & x_{1} \\ ⋮ & ⋮ \\ 1 & x_{n} \end{matrix}]}} \underset{β}{\underset{⏟}{[\begin{matrix} β_{0} \\ β_{1} \end{matrix}]}} + \underset{ϵ}{\underset{⏟}{[\begin{matrix} ϵ_{1} \\ ⋮ \\ ϵ_{n} \end{matrix}]}}$

What are the dimensions of $y$ , $X$ , $β$ , and $ϵ$ ?

Derive least squares estimator for $β$

Goal: Find estimator $\hat{β} = [\begin{matrix} {\hat{β}}_{0} \\ {\hat{β}}_{1} \end{matrix}]$ that minimizes the sum of squared errors $\sum_{i = 1}^{n} ϵ_{i}^{2} = ϵ^{T} ϵ = (y - X β)^{T} (y - X β)$

Gradient

Let $x = [\begin{matrix} x_{1} \\ x_{2} \\ ⋮ \\ x_{k} \end{matrix}]$ be a $k \times 1$ vector and $f (x)$ be a function of $x$ .

. . .

Then $\nabla_{x} f$ , the gradient of $f$ with respect to $x$ is

$\nabla_{x} f = [\begin{matrix} \frac{\partial f}{\partial x_{1}} \\ \frac{\partial f}{\partial x_{2}} \\ ⋮ \\ \frac{\partial f}{\partial x_{k}} \end{matrix}]$

Property 1

Let $x$ be a $k \times 1$ vector and $z$ be a $k \times 1$ vector, such that $z$ is not a function of $x$ .

The gradient of $x^{T} z$ with respect to $x$ is

$\nabla_{x} x^{T} z = z$

Side note: Property 1

$\begin{aligned} x^{T} z & = [\begin{array}{c} x_{1} & x_{2} & \dots & x_{k} \end{array}] [\begin{array}{c} z_{1} \\ z_{2} \\ ⋮ \\ z_{k} \end{array}] \\ = x_{1} z_{1} + x_{2} z_{2} + \dots + x_{k} z_{k} \\ = \sum_{i = 1}^{k} x_{i} z_{i} \end{aligned}$

Side note: Property 1

$\nabla_{x} x^{T} z = [\begin{matrix} \frac{\partial x^{T} z}{\partial x_{1}} \\ \frac{\partial x^{T} z}{\partial x_{2}} \\ ⋮ \\ \frac{\partial x^{T} z}{\partial x_{k}} \end{matrix}] = [\begin{matrix} \frac{\partial}{\partial x_{1}} (x_{1} z_{1} + x_{2} z_{2} + \dots + x_{k} z_{k}) \\ \frac{\partial}{\partial x_{2}} (x_{1} z_{1} + x_{2} z_{2} + \dots + x_{k} z_{k}) \\ ⋮ \\ \frac{\partial}{\partial x_{k}} (x_{1} z_{1} + x_{2} z_{2} + \dots + x_{k} z_{k}) \end{matrix}] = [\begin{matrix} z_{1} \\ z_{2} \\ ⋮ \\ z_{k} \end{matrix}] = z$

Property 2

Let $x$ be a $k \times 1$ vector and $A$ be a $k \times k$ matrix, such that $A$ is not a function of $x$ .

Then the gradient of $x^{T} A x$ with respect to $x$ is

$\nabla_{x} x^{T} A x = (A x + A^{T} x) = (A + A^{T}) x$

If $A$ is symmetric, then

$(A + A^{T}) x = 2 A x$

Proof in HW 01.

Derive least squares estimator

Find $\hat{β}$ that minimizes

$\begin{aligned} ϵ^{T} ϵ & = (y - X β)^{T} (y - X β) \\ = (y^{T} - β^{T} X^{T}) (y - X β) \\ = y^{T} y - y^{T} X β - β^{T} X^{T} y + β^{T} X^{T} X β \\ = y^{T} y - 2 β^{T} X^{T} y + β^{T} X^{T} X β \end{aligned}$

Derive least squares estimator

$\begin{aligned} \nabla_{β} ϵ^{T} ϵ & = \nabla_{β} (y^{T} y - 2 β^{T} X^{T} y + β^{T} X^{T} X β) \\ = - 2 X^{T} y + 2 X^{T} X β \end{aligned}$

Find $\hat{β}$ that satisfies

$- 2 X^{T} y + 2 X^{T} X \hat{β} = 0$

$\hat{β} = (X^{T} X)^{- 1} X^{T} y$

Did we find a minimum?

Hessian matrix

The Hessian matrix, $\nabla_{x}^{2} f$ is a $k \times k$ matrix of partial second derivatives

$\nabla_{x}^{2} f = [\begin{matrix} \frac{\partial^{2} f}{\partial x_{1}^{2}} & \frac{\partial^{2} f}{\partial x_{1} \partial x_{2}} & \dots & \frac{\partial^{2} f}{\partial x_{1} \partial x_{k}} \\ \frac{\partial^{2} f}{\partial x_{2} \partial x_{1}} & \frac{\partial^{2} f}{\partial x_{2}^{2}} & \dots & \frac{\partial^{2} f}{\partial x_{2} \partial x_{k}} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \frac{\partial^{2} f}{\partial x_{k} \partial x_{1}} & \frac{\partial^{2} f}{\partial x_{k} \partial x_{2}} & \dots & \frac{\partial^{2} f}{\partial x_{k}^{2}} \end{matrix}]$

Using the Hessian matrix

If the Hessian matrix is…

positive-definite, then we have found a minimum.
negative-definite, then we have found a maximum.
neither positive or negative-definite, then we have found a saddle point

Did we find a minimum?

$\begin{aligned} \nabla_{β}^{2} ϵ^{T} ϵ & = \nabla_{β} (- 2 X^{T} y + 2 X^{T} X β) \\ = - 2 \nabla_{β} (X^{T} y) + 2 \nabla_{β} (X^{T} X β) \\ \propto X^{T} X \end{aligned}$

Show that $X^{T} X$ is positive definite in HW 01.

Predicted values and residuals

Predicted (fitted) values

Now that we have $\hat{β}$ , let’s predict values of $y$ using the model

$\hat{y} = X \hat{β} = \underset{H}{\underset{⏟}{X (X^{T} X)^{- 1} X^{T}}} y = H y$

. . .

Hat matrix: $H = X (X^{T} X)^{- 1} X^{T}$

. . .

$H$ is an $n \times n$ matrix
Maps vector of observed values $y$ to a vector of fitted values $\hat{y}$
It is only a function of $X$ not $y$

Residuals

Recall that the residuals are the difference between the observed and predicted values

$\begin{aligned} e & = y - \hat{y} \\ = y - X \hat{β} \\ = y - H y \\ e & = (I - H) y \end{aligned}$

Recap

Introduced matrix representation for simple linear regression
- Model form
- Least square estimate
- Predicted (fitted) values
- Residuals

For next class

Complete Prepare for Lecture 05 - SLR: matrix representation cont’d

Announcements

Topics

Model assessment

Two statistics

Application exercise

Matrix representation of simple linear regression

SLR: Statistical model (population)

SLR in matrix form

SLR in matrix form

Derive least squares estimator for β

Gradient

Property 1

Side note: Property 1

Side note: Property 1

Property 2

Derive least squares estimator

Derive least squares estimator

Did we find a minimum?

Hessian matrix

Using the Hessian matrix

Did we find a minimum?

Predicted values and residuals

Predicted (fitted) values

Residuals

Recap

For next class

Derive least squares estimator for $β$