Jan 21, 2025
Lab 01 due on TODAY at 11:59pm
Push work to GitHub repo
Submit final PDF on Gradescope + mark pages for each question
HW 01 will be assigned on Thursday
Root mean square error, RMSE: A measure of the average error (average difference between observed and predicted values of the outcome)
\[ RMSE = \sqrt{\frac{\sum_{i=1}^n(y_i - \hat{y}_i)^2}{n}} = \sqrt{\frac{\sum_{i=1}^ne_i^2}{n}} \]
R-squared, \(R^2\) : Percentage of variability in the outcome explained by the regression model (in the context of SLR, the predictor)
\[R^2 = \frac{SSM}{SST} = 1 - \frac{SSR}{SST}\]
📋 sta221-sp25.netlify.app/ae/ae-01-model-assessment.html
Open ae-01 from last class. Complete Part 2.
When we have a quantitative response, \(Y\), and a single quantitative predictor, \(X\), we can use a simple linear regression model to describe the relationship between \(Y\) and \(X\).
\[Y = \beta_0 + \beta_1 X + \epsilon\]
The simple linear regression model can be represented using vectors and matrices as
\[ \large{\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}} \]
\(\mathbf{y}\) : Vector of responses
\(\mathbf{X}\): Design matrix (columns for predictors + intercept)
\(\boldsymbol{\beta}\): Vector of model coefficients
\(\boldsymbol{\epsilon}\): Vector of error terms centered at \(\mathbf{0}\) with variance \(\sigma^2_{\epsilon}\mathbf{I}\)
\[ \underbrace{ \begin{bmatrix} y_1 \\ \vdots \\ y_n \end{bmatrix} }_ {\mathbf{y}} \hspace{3mm} = \hspace{3mm} \underbrace{ \begin{bmatrix} 1 &x_1 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix} }_{\mathbf{X}} \hspace{2mm} \underbrace{ \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} }_{\boldsymbol{\beta}} \hspace{3mm} + \hspace{3mm} \underbrace{ \begin{bmatrix} \epsilon_1 \\ \vdots\\ \epsilon_n \end{bmatrix} }_\boldsymbol{\epsilon} \]
What are the dimensions of \(\mathbf{y}\), \(\mathbf{X}\), \(\boldsymbol{\beta}\), and \(\boldsymbol{\epsilon}\)?
Goal: Find estimator \(\hat{\boldsymbol{\beta}}= \begin{bmatrix}\hat{\beta}_0 \\ \hat{\beta}_1 \end{bmatrix}\) that minimizes the sum of squared errors \[ \sum_{i=1}^n \epsilon_i^2 = \mathbf{\epsilon}^\mathsf{T}\mathbf{\epsilon} = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\mathsf{T}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \]
Let \(\mathbf{x} = \begin{bmatrix}x_1 \\ x_2 \\ \vdots \\x_k\end{bmatrix}\)be a \(k \times 1\) vector and \(f(\mathbf{x})\) be a function of \(\mathbf{x}\).
Then \(\nabla_\mathbf{x}f\), the gradient of \(f\) with respect to \(\mathbf{x}\) is
\[ \nabla_\mathbf{x}f = \begin{bmatrix}\frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_k}\end{bmatrix} \]
Let \(\mathbf{x}\) be a \(k \times 1\) vector and \(\mathbf{z}\) be a \(k \times 1\) vector, such that \(\mathbf{z}\) is not a function of \(\mathbf{x}\) .
The gradient of \(\mathbf{x}^\mathsf{T}\mathbf{z}\) with respect to \(\mathbf{x}\) is
\[ \nabla_\mathbf{x} \hspace{1mm} \mathbf{x}^\mathsf{T}\mathbf{z} = \mathbf{z} \]
\[ \begin{aligned} \mathbf{x}^\mathsf{T}\mathbf{z} &= \class{fragment}{\begin{bmatrix}x_1 & x_2 & \dots &x_k\end{bmatrix} \begin{bmatrix}z_1 \\ z_2 \\ \vdots \\z_k\end{bmatrix}} \\[10pt] &\class{fragment}{= x_1z_1 + x_2z_2 + \dots + x_kz_k} \\ &\class{fragment}{= \sum_{i=1}^k x_iz_i} \end{aligned} \]
\[ \nabla_\mathbf{x}\hspace{1mm}\mathbf{x}^\mathsf{T}\mathbf{z} = \class{fragment}{\begin{bmatrix}\frac{\partial \mathbf{x}^\mathsf{T}\mathbf{z}}{\partial x_1} \\ \frac{\partial \mathbf{x}^\mathsf{T}\mathbf{z}}{\partial x_2} \\ \vdots \\ \frac{\partial \mathbf{x}^\mathsf{T}\mathbf{z}}{\partial x_k}\end{bmatrix}} = \class{fragment}{\begin{bmatrix}\frac{\partial}{\partial x_1} (x_1z_1 + x_2z_2 + \dots + x_kz_k) \\ \frac{\partial}{\partial x_2} (x_1z_1 + x_2z_2 + \dots + x_kz_k)\\ \vdots \\ \frac{\partial}{\partial x_k} (x_1z_1 + x_2z_2 + \dots + x_kz_k)\end{bmatrix}} = \class{fragment}{\begin{bmatrix} z_1 \\ z_2 \\ \vdots \\ z_k\end{bmatrix} = \mathbf{z}} \]
Let \(\mathbf{x}\) be a \(k \times 1\) vector and \(\mathbf{A}\) be a \(k \times k\) matrix, such that \(\mathbf{A}\) is not a function of \(\mathbf{x}\) .
Then the gradient of \(\mathbf{x}^\mathsf{T}\mathbf{A}\mathbf{x}\) with respect to \(\mathbf{x}\) is
\[ \nabla_\mathbf{x} \hspace{1mm} \mathbf{x}^\mathsf{T}\mathbf{A}\mathbf{x} = (\mathbf{A}\mathbf{x} + \mathbf{A}^\mathsf{T} \mathbf{x}) = (\mathbf{A} + \mathbf{A}^\mathsf{T})\mathbf{x} \]
If \(\mathbf{A}\) is symmetric, then
\[ (\mathbf{A} + \mathbf{A}^\mathsf{T})\mathbf{x} = 2\mathbf{A}\mathbf{x} \]
Proof in HW 01.
Find \(\hat{\boldsymbol{\beta}}\) that minimizes
\[ \begin{aligned} \boldsymbol{\epsilon}^\mathsf{T}\boldsymbol{\epsilon} &= (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\mathsf{T}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \\[10pt] &= (\mathbf{y}^\mathsf{T} - \boldsymbol{\beta}^\mathsf{T}\mathbf{X}^\mathsf{T})(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})\\[10pt] &=\mathbf{y}^\mathsf{T}\mathbf{y} - \mathbf{y}^\mathsf{T}\mathbf{X}\boldsymbol{\beta} - \boldsymbol{\beta}^\mathsf{T}\mathbf{X}^\mathsf{T}\mathbf{y} + \boldsymbol{\beta}^\mathsf{T}\mathbf{X}^\mathsf{T}\mathbf{X}\boldsymbol{\beta}\\[10pt] &=\mathbf{y}^\mathsf{T}\mathbf{y} - 2\boldsymbol{\beta}^\mathsf{T}\mathbf{X}^\mathsf{T}\mathbf{y} + \boldsymbol{\beta}^\mathsf{T}\mathbf{X}^\mathsf{T}\mathbf{X}\boldsymbol{\beta} \end{aligned} \]
\[\begin{aligned}
\nabla_{\beta}\boldsymbol{\epsilon}^\mathsf{T}\boldsymbol{\epsilon} &= \nabla_{\boldsymbol{\beta}}( \mathbf{y}^\mathsf{T}\mathbf{y} - 2\boldsymbol{\beta}^\mathsf{T}\mathbf{X}^\mathsf{T}\mathbf{y} + \boldsymbol{\beta}^\mathsf{T}\mathbf{X}^\mathsf{T}\mathbf{X}\boldsymbol{\beta}) \\[10pt]
& = -2\mathbf{X}^\mathsf{T}\mathbf{y} + 2\mathbf{X}^\mathsf{T}\mathbf{X}\boldsymbol{\beta}
\end{aligned}
\]
Find \(\hat{\boldsymbol{\beta}}\) that satisfies
\[ -2\mathbf{X}^\mathsf{T}\mathbf{y} + 2\mathbf{X}^\mathsf{T}\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{0} \]
\[\hat{\boldsymbol{\beta}} = (\mathbf{X}^\mathsf{T}\mathbf{X})^{-1}\mathbf{X}^\mathsf{T}\mathbf{y}\]
The Hessian matrix, \(\nabla_\mathbf{x}^2f\) is a \(k \times k\) matrix of partial second derivatives
\[ \nabla_{\mathbf{x}}^2f = \begin{bmatrix} \frac{\partial^2f}{\partial x_1^2} & \frac{\partial^2f}{\partial x_1 \partial x_2} & \dots & \frac{\partial^2f}{\partial x_1\partial x_k} \\ \frac{\partial^2f}{\partial\ x_2 \partial x_1} & \frac{\partial^2f}{\partial x_2^2} & \dots & \frac{\partial^2f}{\partial x_2 \partial x_k} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2f}{\partial x_k\partial x_1} & \frac{\partial^2f}{\partial x_k\partial x_2} & \dots & \frac{\partial^2f}{\partial x_k^2} \end{bmatrix} \]
If the Hessian matrix is…
positive-definite, then we have found a minimum.
negative-definite, then we have found a maximum.
neither positive or negative-definite, then we have found a saddle point
\[ \begin{aligned} \nabla^2_{\boldsymbol{\beta}} \boldsymbol{\epsilon}^\mathsf{T}\boldsymbol{\epsilon} &= \nabla_{\boldsymbol{\beta}} (-2\mathbf{X}^\mathsf{T}\mathbf{y} + 2\mathbf{X}^\mathsf{T}\mathbf{X}\boldsymbol{\beta}) \\[10pt] &{=-2\nabla_{\boldsymbol{\beta}}(\mathbf{X}^\mathsf{T}\mathbf{y}) + 2\nabla_{\boldsymbol{\beta}}(\mathbf{X}^\mathsf{T}\mathbf{X}\mathbf{\beta})} \\[10pt] &{\propto \mathbf{X}^\mathsf{T}\mathbf{X}} \end{aligned} \]
Show that \(\mathbf{X}^\mathsf{T}\mathbf{X}\) is positive definite in HW 01.
Now that we have \(\hat{\boldsymbol{\beta}}\), let’s predict values of \(\mathbf{y}\) using the model
\[ \hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}} = \underbrace{\mathbf{X}(\mathbf{X}^\mathsf{T}\mathbf{X})^{-1}\mathbf{X}^\mathsf{T}}_{\mathbf{H}}\mathbf{y} = \mathbf{H}\mathbf{y} \]
Hat matrix: \(\mathbf{H} = \mathbf{X}(\mathbf{X}^\mathsf{T}\mathbf{X})^{-1}\mathbf{X}^\mathsf{T}\)
Recall that the residuals are the difference between the observed and predicted values
\[ \begin{aligned} \mathbf{e} &= \mathbf{y} - \hat{\mathbf{y}}\\[10pt] &\class{fragment}{ = \mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}} \\[10pt] &\class{fragment}{ = \mathbf{y} - \mathbf{H}\mathbf{y}} \\[20pt] \class{fragment}{\color{#993399}{\mathbf{e}}} &\class{fragment}{\color{#993399}{=(\mathbf{I} - \mathbf{H})\mathbf{y}}} \\[10pt] \end{aligned} \]
Introduced matrix representation for simple linear regression