Class 19: Evaluating linear models

Methodology of Scientific Research

Andrés Aravena, PhD

April 12, 2023

Squared error

We have experimental values \(y_i\) that we want to approximate with a straight line \(β_0 + β_1 x_i\)

The line is not a perfect model. The error it make is \[y_i - β_0 - β_1 x_i\] This can be positive or negative. We want all positives. So we square it \[(y_i - β_0 - β_1 x_i)^2\]

Mean Square Error

There are \(n\) experimental values. They Mean Squared Error is \[MSE=\frac{1}{n}\sum_i (y_i - β_0 - β_1 x_i)^2\]

We can calculate it directly in the spreadsheet

Please calculate it for the Brain Cancer data

Compare it with the variance of time

Evaluating the Mean Square Error

\[MSE=\frac{1}{n}\sum_i (y_i - β_0 - β_1 x_i)^2\] Replacing \(β_0 = \overline{𝐲} - β_1 \overline{𝐱}\) we have \[MSE=\frac{1}{n}\sum_i (y_i - \overline{𝐲} + β_1 \overline{𝐱} - β_1 x_i)^2\] which we rewrite as \[MSE=\frac{1}{n}\sum_i ((y_i - \overline{𝐲}) - β_1 (x_i- \overline{𝐱}))^2\]

Expanding the parenthesis

\[\begin{aligned} MSE &=\frac{1}{n}\sum_i ((y_i - \overline{𝐲})^2 - 2 β_1(y_i - \overline{𝐲})(x_i- \overline{𝐱}) + β_1^2 (x_i- \overline{𝐱})^2)\\ &=\frac{1}{n}\sum_i (y_i - \overline{𝐲})^2 - 2 \frac{β_1}{n}\sum_i(y_i - \overline{𝐲})(x_i- \overline{𝐱}) + \frac{β_1^2}{n}\sum_i(x_i- \overline{𝐱})^2\\ \end{aligned}\] We recognize some terms \[MSE=\text{var}(𝐲)+2 β_1\text{cov}(𝐱,𝐲) + β_1^2 \text{var}(𝐱)\]

Replacing \(β_1\)

We have \(MSE=\text{var}(𝐲)+2 β_1\text{cov}(𝐱,𝐲) + β_1^2 \text{var}(𝐱)\)

Replacing \(β_1=\frac{\text{cov}(𝐱, 𝐲)}{\text{var}(𝐱)}\) we get \[\begin{aligned} MSE&=\text{var}(𝐲)-2\frac{\text{cov}(𝐱, 𝐲)}{\text{var}(𝐱)}\text{cov}(𝐱,𝐲) + \left(\frac{\text{cov}(𝐱, 𝐲)}{\text{var}(𝐱)}\right)^2 \text{var}(𝐱)\\ &=\text{var}(𝐲)-2\frac{\text{cov}^2(𝐱, 𝐲)}{\text{var}(𝐱)} + \frac{\text{cov}^2 (𝐱, 𝐲)}{\text{var}(𝐱)}\\ &=\text{var}(𝐲) - \frac{\text{cov}^2 (𝐱, 𝐲)}{\text{var}(𝐱)}\\ \end{aligned}\]

Interpretation

We have two ways to summarize our dataset

  • Model 0. With a single value: the average
  • Model 1. With two values: a straight line

Each one has its own Mean Squared Error

  • \(\text{MSE}_0\) for model 0
  • \(\text{MSE}_1\) for model 1

What are the values of \(\text{MSE}_0\) and \(\text{MSE}_1\)?

This time instead of

\[\text{MSE}_0=\frac{1}{n}\sum_i (y_i - \overline{𝐲})^2=\text{var}(𝐲)\] we have \[\text{MSE}_1=\text{var}(𝐲) - \frac{\text{cov}^2(𝐱,𝐲)}{\text{var}(𝐱)}\]

This already shows that the mean square error is better in model 1 than in model 0

Relative improvement

The variance in the new model is better, but how much? \[\frac{\text{MSE}_0-\text{MSE}_1}{\text{MSE}_0}=\frac{\text{cov}^2(𝐱,𝐲)}{\text{var}(𝐱)\text{var}(𝐲)}\] This number represents the percentage of the original variance that is explained by the new model

The name of this number is \(R^2\)

Does it sound familiar?

Correlation coefficient

The Pearson correlation coefficient between two variables is \[r=\frac{\text{cov}(𝐱,𝐲)}{\text{sdev}(𝐱)\text{sdev}(𝐲)}\] so we have in this case that \[R^2 = r^2\] This is valid for linear models with a single independent variable. It will not be valid for larger models

Interpretation of \(R^2\)

\[R^2=\frac{\text{MSE}_0-\text{MSE}_1}{MSE_0}=\frac{\text{var}(𝐱)-\text{MSE}_1}{\text{var}(𝐱)}\]

Here \(\text{MSE}_0\) is the variance with a simple model

\(\text{MSE}_1\) is the variance with an advanced model

\(R^2\) is the percentage of the variance reduced with the advanced model

Doing the math

\[\begin{aligned} R^2 &=\frac{\text{MSE}_0-\text{MSE}_1}{MSE_0}\\ &=\frac{\text{var}(𝐲)-(\text{var}(𝐲) - \frac{\text{cov}^2(𝐱,𝐲)}{\text{var}(𝐱)})}{\text{var}(𝐲)}\\ &=\frac{\frac{\text{cov}^2(𝐱,𝐲)}{\text{var}(𝐱)}}{\text{var}(𝐲)}=\frac{\text{cov}^2(𝐱,𝐲)}{\text{var}(𝐱)\text{var}(𝐲)} \end{aligned}\]

Practice

Calculate these vales using Brain Cancer data

  • \(R^2\) using sum of squared errors
  • \(R^2\) using variance and covariance
  • \(r\) using CORREL() function
  • Compare \(R^2\) against \(r^2\)