# Methodology of Scientific Research

## Squared error

We have experimental values $$y_i$$ that we want to approximate with a straight line $$β_0 + β_1 x_i$$

The line is not a perfect model. The error it make is $y_i - β_0 - β_1 x_i$ This can be positive or negative. We want all positives. So we square it $(y_i - β_0 - β_1 x_i)^2$

## Mean Square Error

There are $$n$$ experimental values. They Mean Squared Error is $MSE=\frac{1}{n}\sum_i (y_i - β_0 - β_1 x_i)^2$

We can calculate it directly in the spreadsheet

Please calculate it for the Brain Cancer data

Compare it with the variance of time

## Evaluating the Mean Square Error

$MSE=\frac{1}{n}\sum_i (y_i - β_0 - β_1 x_i)^2$ Replacing $$β_0 = \overline{𝐲} - β_1 \overline{𝐱}$$ we have $MSE=\frac{1}{n}\sum_i (y_i - \overline{𝐲} + β_1 \overline{𝐱} - β_1 x_i)^2$ which we rewrite as $MSE=\frac{1}{n}\sum_i ((y_i - \overline{𝐲}) - β_1 (x_i- \overline{𝐱}))^2$

## Expanding the parenthesis

\begin{aligned} MSE &=\frac{1}{n}\sum_i ((y_i - \overline{𝐲})^2 - 2 β_1(y_i - \overline{𝐲})(x_i- \overline{𝐱}) + β_1^2 (x_i- \overline{𝐱})^2)\\ &=\frac{1}{n}\sum_i (y_i - \overline{𝐲})^2 - 2 \frac{β_1}{n}\sum_i(y_i - \overline{𝐲})(x_i- \overline{𝐱}) + \frac{β_1^2}{n}\sum_i(x_i- \overline{𝐱})^2\\ \end{aligned} We recognize some terms $MSE=\text{var}(𝐲)+2 β_1\text{cov}(𝐱,𝐲) + β_1^2 \text{var}(𝐱)$

## Replacing $$β_1$$

We have $$MSE=\text{var}(𝐲)+2 β_1\text{cov}(𝐱,𝐲) + β_1^2 \text{var}(𝐱)$$

Replacing $$β_1=\frac{\text{cov}(𝐱, 𝐲)}{\text{var}(𝐱)}$$ we get \begin{aligned} MSE&=\text{var}(𝐲)-2\frac{\text{cov}(𝐱, 𝐲)}{\text{var}(𝐱)}\text{cov}(𝐱,𝐲) + \left(\frac{\text{cov}(𝐱, 𝐲)}{\text{var}(𝐱)}\right)^2 \text{var}(𝐱)\\ &=\text{var}(𝐲)-2\frac{\text{cov}^2(𝐱, 𝐲)}{\text{var}(𝐱)} + \frac{\text{cov}^2 (𝐱, 𝐲)}{\text{var}(𝐱)}\\ &=\text{var}(𝐲) - \frac{\text{cov}^2 (𝐱, 𝐲)}{\text{var}(𝐱)}\\ \end{aligned}

## Interpretation

We have two ways to summarize our dataset

• Model 0. With a single value: the average
• Model 1. With two values: a straight line

Each one has its own Mean Squared Error

• $$\text{MSE}_0$$ for model 0
• $$\text{MSE}_1$$ for model 1

What are the values of $$\text{MSE}_0$$ and $$\text{MSE}_1$$?

## This time instead of

$\text{MSE}_0=\frac{1}{n}\sum_i (y_i - \overline{𝐲})^2=\text{var}(𝐲)$ we have $\text{MSE}_1=\text{var}(𝐲) - \frac{\text{cov}^2(𝐱,𝐲)}{\text{var}(𝐱)}$

This already shows that the mean square error is better in model 1 than in model 0

## Relative improvement

The variance in the new model is better, but how much? $\frac{\text{MSE}_0-\text{MSE}_1}{\text{MSE}_0}=\frac{\text{cov}^2(𝐱,𝐲)}{\text{var}(𝐱)\text{var}(𝐲)}$ This number represents the percentage of the original variance that is explained by the new model

The name of this number is $$R^2$$

Does it sound familiar?

## Correlation coefficient

The Pearson correlation coefficient between two variables is $r=\frac{\text{cov}(𝐱,𝐲)}{\text{sdev}(𝐱)\text{sdev}(𝐲)}$ so we have in this case that $R^2 = r^2$ This is valid for linear models with a single independent variable. It will not be valid for larger models

## Interpretation of $$R^2$$

$R^2=\frac{\text{MSE}_0-\text{MSE}_1}{MSE_0}=\frac{\text{var}(𝐱)-\text{MSE}_1}{\text{var}(𝐱)}$

Here $$\text{MSE}_0$$ is the variance with a simple model

$$\text{MSE}_1$$ is the variance with an advanced model

$$R^2$$ is the percentage of the variance reduced with the advanced model

## Doing the math

\begin{aligned} R^2 &=\frac{\text{MSE}_0-\text{MSE}_1}{MSE_0}\\ &=\frac{\text{var}(𝐲)-(\text{var}(𝐲) - \frac{\text{cov}^2(𝐱,𝐲)}{\text{var}(𝐱)})}{\text{var}(𝐲)}\\ &=\frac{\frac{\text{cov}^2(𝐱,𝐲)}{\text{var}(𝐱)}}{\text{var}(𝐲)}=\frac{\text{cov}^2(𝐱,𝐲)}{\text{var}(𝐱)\text{var}(𝐲)} \end{aligned}

## Practice

Calculate these vales using Brain Cancer data

• $$R^2$$ using sum of squared errors
• $$R^2$$ using variance and covariance
• $$r$$ using CORREL() function
• Compare $$R^2$$ against $$r^2$$