We have experimental values \(y_i\) that we want to approximate with a straight line \(β_0 + β_1 x_i\)
The line is not a perfect model. The error it make is \[y_i - β_0 - β_1 x_i\] This can be positive or negative. We want all positives. So we square it \[(y_i - β_0 - β_1 x_i)^2\]
There are \(n\) experimental values. They Mean Squared Error is \[MSE=\frac{1}{n}\sum_i (y_i - β_0 - β_1 x_i)^2\]
We can calculate it directly in the spreadsheet
Please calculate it for the Brain Cancer data
Compare it with the variance of time
\[MSE=\frac{1}{n}\sum_i (y_i - β_0 - β_1 x_i)^2\] Replacing \(β_0 = \overline{𝐲} - β_1 \overline{𝐱}\) we have \[MSE=\frac{1}{n}\sum_i (y_i - \overline{𝐲} + β_1 \overline{𝐱} - β_1 x_i)^2\] which we rewrite as \[MSE=\frac{1}{n}\sum_i ((y_i - \overline{𝐲}) - β_1 (x_i- \overline{𝐱}))^2\]
\[\begin{aligned} MSE &=\frac{1}{n}\sum_i ((y_i - \overline{𝐲})^2 - 2 β_1(y_i - \overline{𝐲})(x_i- \overline{𝐱}) + β_1^2 (x_i- \overline{𝐱})^2)\\ &=\frac{1}{n}\sum_i (y_i - \overline{𝐲})^2 - 2 \frac{β_1}{n}\sum_i(y_i - \overline{𝐲})(x_i- \overline{𝐱}) + \frac{β_1^2}{n}\sum_i(x_i- \overline{𝐱})^2\\ \end{aligned}\] We recognize some terms \[MSE=\text{var}(𝐲)+2 β_1\text{cov}(𝐱,𝐲) + β_1^2 \text{var}(𝐱)\]
We have \(MSE=\text{var}(𝐲)+2 β_1\text{cov}(𝐱,𝐲) + β_1^2 \text{var}(𝐱)\)
Replacing \(β_1=\frac{\text{cov}(𝐱, 𝐲)}{\text{var}(𝐱)}\) we get \[\begin{aligned} MSE&=\text{var}(𝐲)-2\frac{\text{cov}(𝐱, 𝐲)}{\text{var}(𝐱)}\text{cov}(𝐱,𝐲) + \left(\frac{\text{cov}(𝐱, 𝐲)}{\text{var}(𝐱)}\right)^2 \text{var}(𝐱)\\ &=\text{var}(𝐲)-2\frac{\text{cov}^2(𝐱, 𝐲)}{\text{var}(𝐱)} + \frac{\text{cov}^2 (𝐱, 𝐲)}{\text{var}(𝐱)}\\ &=\text{var}(𝐲) - \frac{\text{cov}^2 (𝐱, 𝐲)}{\text{var}(𝐱)}\\ \end{aligned}\]
We have two ways to summarize our dataset
Each one has its own Mean Squared Error
What are the values of \(\text{MSE}_0\) and \(\text{MSE}_1\)?
\[\text{MSE}_0=\frac{1}{n}\sum_i (y_i - \overline{𝐲})^2=\text{var}(𝐲)\] we have \[\text{MSE}_1=\text{var}(𝐲) - \frac{\text{cov}^2(𝐱,𝐲)}{\text{var}(𝐱)}\]
This already shows that the mean square error is better in model 1 than in model 0
The variance in the new model is better, but how much? \[\frac{\text{MSE}_0-\text{MSE}_1}{\text{MSE}_0}=\frac{\text{cov}^2(𝐱,𝐲)}{\text{var}(𝐱)\text{var}(𝐲)}\] This number represents the percentage of the original variance that is explained by the new model
The name of this number is \(R^2\)
Does it sound familiar?
The Pearson correlation coefficient between two variables is \[r=\frac{\text{cov}(𝐱,𝐲)}{\text{sdev}(𝐱)\text{sdev}(𝐲)}\] so we have in this case that \[R^2 = r^2\] This is valid for linear models with a single independent variable. It will not be valid for larger models
\[R^2=\frac{\text{MSE}_0-\text{MSE}_1}{MSE_0}=\frac{\text{var}(𝐱)-\text{MSE}_1}{\text{var}(𝐱)}\]
Here \(\text{MSE}_0\) is the variance with a simple model
\(\text{MSE}_1\) is the variance with an advanced model
\(R^2\) is the percentage of the variance reduced with the advanced model
\[\begin{aligned} R^2 &=\frac{\text{MSE}_0-\text{MSE}_1}{MSE_0}\\ &=\frac{\text{var}(𝐲)-(\text{var}(𝐲) - \frac{\text{cov}^2(𝐱,𝐲)}{\text{var}(𝐱)})}{\text{var}(𝐲)}\\ &=\frac{\frac{\text{cov}^2(𝐱,𝐲)}{\text{var}(𝐱)}}{\text{var}(𝐲)}=\frac{\text{cov}^2(𝐱,𝐲)}{\text{var}(𝐱)\text{var}(𝐲)} \end{aligned}\]
Calculate these vales using Brain Cancer data
CORREL()
function