We have experimental values \(y_i\) that we want to approximate with a straight line \(β_0 + β_1 x_i\)

The line is not a perfect model. The *error* it make is \[y_i - β_0 - β_1 x_i\] This can be positive
or negative. We want all positives. So we square it \[(y_i - β_0 - β_1 x_i)^2\]

There are \(n\) experimental values.
They *Mean Squared Error* is \[MSE=\frac{1}{n}\sum_i (y_i - β_0 - β_1
x_i)^2\]

We can calculate it directly in the spreadsheet

Please calculate it for the *Brain Cancer* data

Compare it with the variance of *time*

\[MSE=\frac{1}{n}\sum_i (y_i - β_0 - β_1 x_i)^2\] Replacing \(β_0 = \overline{𝐲} - β_1 \overline{𝐱}\) we have \[MSE=\frac{1}{n}\sum_i (y_i - \overline{𝐲} + β_1 \overline{𝐱} - β_1 x_i)^2\] which we rewrite as \[MSE=\frac{1}{n}\sum_i ((y_i - \overline{𝐲}) - β_1 (x_i- \overline{𝐱}))^2\]

\[\begin{aligned} MSE &=\frac{1}{n}\sum_i ((y_i - \overline{𝐲})^2 - 2 β_1(y_i - \overline{𝐲})(x_i- \overline{𝐱}) + β_1^2 (x_i- \overline{𝐱})^2)\\ &=\frac{1}{n}\sum_i (y_i - \overline{𝐲})^2 - 2 \frac{β_1}{n}\sum_i(y_i - \overline{𝐲})(x_i- \overline{𝐱}) + \frac{β_1^2}{n}\sum_i(x_i- \overline{𝐱})^2\\ \end{aligned}\] We recognize some terms \[MSE=\text{var}(𝐲)+2 β_1\text{cov}(𝐱,𝐲) + β_1^2 \text{var}(𝐱)\]

We have \(MSE=\text{var}(𝐲)+2 β_1\text{cov}(𝐱,𝐲) + β_1^2 \text{var}(𝐱)\)

Replacing \(β_1=\frac{\text{cov}(𝐱, 𝐲)}{\text{var}(𝐱)}\) we get \[\begin{aligned} MSE&=\text{var}(𝐲)-2\frac{\text{cov}(𝐱, 𝐲)}{\text{var}(𝐱)}\text{cov}(𝐱,𝐲) + \left(\frac{\text{cov}(𝐱, 𝐲)}{\text{var}(𝐱)}\right)^2 \text{var}(𝐱)\\ &=\text{var}(𝐲)-2\frac{\text{cov}^2(𝐱, 𝐲)}{\text{var}(𝐱)} + \frac{\text{cov}^2 (𝐱, 𝐲)}{\text{var}(𝐱)}\\ &=\text{var}(𝐲) - \frac{\text{cov}^2 (𝐱, 𝐲)}{\text{var}(𝐱)}\\ \end{aligned}\]

We have two ways to summarize our dataset

- Model 0. With a single value: the average
- Model 1. With two values: a straight line

Each one has its own Mean Squared Error

- \(\text{MSE}_0\) for model 0
- \(\text{MSE}_1\) for model 1

**What are the values of \(\text{MSE}_0\) and \(\text{MSE}_1\)?**

\[\text{MSE}_0=\frac{1}{n}\sum_i (y_i - \overline{𝐲})^2=\text{var}(𝐲)\] we have \[\text{MSE}_1=\text{var}(𝐲) - \frac{\text{cov}^2(𝐱,𝐲)}{\text{var}(𝐱)}\]

This already shows that the mean square error is better in model 1 than in model 0

The variance in the new model is better, but how much? \[\frac{\text{MSE}_0-\text{MSE}_1}{\text{MSE}_0}=\frac{\text{cov}^2(𝐱,𝐲)}{\text{var}(𝐱)\text{var}(𝐲)}\] This number represents the percentage of the original variance that is explained by the new model

The name of this number is \(R^2\)

**Does it sound familiar?**

The *Pearson correlation coefficient* between two variables is
\[r=\frac{\text{cov}(𝐱,𝐲)}{\text{sdev}(𝐱)\text{sdev}(𝐲)}\]
so we have **in this case** that \[R^2 = r^2\] This is valid for linear
models with a single independent variable. It will not be valid for
larger models

\[R^2=\frac{\text{MSE}_0-\text{MSE}_1}{MSE_0}=\frac{\text{var}(𝐱)-\text{MSE}_1}{\text{var}(𝐱)}\]

Here \(\text{MSE}_0\) is the variance with a simple model

\(\text{MSE}_1\) is the variance with an advanced model

\(R^2\) is the percentage of the variance reduced with the advanced model

\[\begin{aligned} R^2 &=\frac{\text{MSE}_0-\text{MSE}_1}{MSE_0}\\ &=\frac{\text{var}(𝐲)-(\text{var}(𝐲) - \frac{\text{cov}^2(𝐱,𝐲)}{\text{var}(𝐱)})}{\text{var}(𝐲)}\\ &=\frac{\frac{\text{cov}^2(𝐱,𝐲)}{\text{var}(𝐱)}}{\text{var}(𝐲)}=\frac{\text{cov}^2(𝐱,𝐲)}{\text{var}(𝐱)\text{var}(𝐲)} \end{aligned}\]

Calculate these vales using *Brain Cancer* data

- \(R^2\) using sum of squared errors
- \(R^2\) using variance and covariance
- \(r\) using
`CORREL()`

function - Compare \(R^2\) against \(r^2\)