Methodology of Scientific Research

Summary of last class

Choosing “the best” representative depends on the way we measure “how bad is it”

Once we choose an error function, we look for the value that gives the smallest error

(we say it minimizes the error function)

Median minimizes the absolute error

Mean minimizes the squared error

Dispersion

How good is the average?

We found that the average $$\bar{𝐲}$$ is the value $$β$$ that minimizes the squared error $\mathrm{SE}_𝐲 (β)=\sum_i (y_i-β)^2$ This is our initial measure of “quality of representative”

Larger values of squared error are bad

Why makes the squared error to be large?

SE can grow for two reasons

• Data values $$y_i$$ separate and get more spread
• There are more values $$y_i$$ in the set

The first part is good, it is what we want

But the second is unfortunate

How can we correct it?

Mean Squared Error

To compensate, we divide by the number of values $\mathrm{MSE}_𝐲 (β)=\frac 1 n \sum_i (y_i-β)^2$

The smallest MSE is achieved when $$β$$ is the mean $$\bar{𝐲}$$ $\text{Smallest } \mathrm{MSE}_𝐲 (\bar{𝐲})=\frac 1 n \sum_i (y_i-\bar{𝐲})^2$

This value is called variance of $$𝐲$$

Alternative variance formula

\begin{aligned} \mathrm{var}(𝐲)&=\frac 1 n \sum_i (y_i-\bar{𝐲})^2=\frac 1 n \sum_i (y_i^2-2\bar{𝐲}y_i+ \bar{𝐲}^2)\\ &=\frac 1 n \sum_i y_i^2-2\bar{𝐲}\frac 1 n \sum_i y_i+ \bar{𝐲}^2\frac 1 n \sum_i 1\\ &=\frac 1 n \sum_i y_i^2-2\bar{𝐲}\bar{𝐲}+ \bar{𝐲}^2\frac 1 n n\\ &=\frac 1 n \sum_i y_i^2-2\bar{𝐲}^2+ \bar{𝐲}^2\\ &=\frac 1 n \sum_i y_i^2-\bar{𝐲}^2\\ \end{aligned}

To remember

$\mathrm{var}(𝐲)=\frac 1 n \sum_i (y_i-\bar{𝐲})^2=\frac 1 n \sum_i y_i^2-\bar{𝐲}^2$

“The average of the squares minus the square of the average”

From last class

We saw that $\frac 1 n \sum_i y_i^2≥\bar{𝐲}^2$ Therefore we always have $\frac 1 n \sum_i y_i^2-\bar{𝐲}^2≥0$

Standard deviation

The units of the variance are squared

If $$𝐲$$ is in meters, then $$\mathrm{var}(𝐲)$$ is in squared meters

Often it is better to use the original units

In that case we use the standard deviation

$\mathrm{sdev}(𝐲)=\sqrt{\mathrm{var}(𝐲)}$

Change of units

Values change when we change units

All values $$y_i$$ are multiplied by a fixed constant $$k$$

\begin{aligned} \mathrm{var}(k⋅𝐲) &= k^2⋅\mathrm{var}(𝐲)\\ \mathrm{sdev}(k⋅𝐲) &= k⋅\mathrm{sdev}(𝐲) \end{aligned}

Multiplicative constants increase the variance quadratically

Standard deviation increases in direct proportion

Sum of two vectors

\begin{aligned} \mathrm{var}(𝐱+𝐲)&=\frac 1 n \sum_i (x_i+ y_i-\bar{𝐱}-\bar{𝐲})^2\\ &=\frac 1 n \sum_i ((x_i-\bar{𝐱})+ (y_i-\bar{𝐲}))^2\\ &=\frac 1 n \sum_i \left((x_i-\bar{𝐱})^2 +(y_i-\bar{𝐲})^2+ 2(x_i-\bar{𝐱})(y_i-\bar{𝐲})\right)\\ &=\frac 1 n \sum_i (x_i-\bar{𝐱})^2 +\frac 1 n \sum_i (y_i-\bar{𝐲})^2+ 2\frac 1 n \sum_i (x_i-\bar{𝐱})(y_i-\bar{𝐲})\\ &=\mathrm{var}(𝐱) +\mathrm{var}(𝐲)+ 2\frac 1 n \sum_i (x_i-\bar{𝐱})(y_i-\bar{𝐲}) \end{aligned}

What is this extra term?

The expression $\frac 1 n \sum_i (x_i-\bar{𝐱})(y_i-\bar{𝐲})$ is called covariance of $$𝐱$$ and $$𝐲$$

We write it as $\mathrm{cov}(𝐱,𝐲)$

Then the variance of the sum is

$\mathrm{var}(𝐱+𝐲)=\mathrm{var}(𝐱) +\mathrm{var}(𝐲)+ 2\mathrm{cov}(𝐱,𝐲)$

The variance of the sum is the sum of the variances plus twice the covariance

Alternative expression

\begin{aligned} \frac 1 n \sum_i (x_i-\bar{𝐱})(y_i-\bar{𝐲})&=\frac 1 n \sum_i (x_i y_i-\bar{𝐱}y_i+x_i\bar{𝐲}-\bar{𝐱}\bar{𝐲})\\ &=\frac 1 n \sum_i x_i y_i-\frac 1 n \sum_i\bar{𝐱}y_i-\frac 1 n \sum_i x_i\bar{𝐲}+\frac 1 n \sum_i\bar{𝐱}\bar{𝐲}\\ &=\frac 1 n \sum_i x_i y_i-\bar{𝐱}\frac 1 n \sum_i y_i - \bar{𝐲}\frac 1 n \sum_i x_i + \bar{𝐱}\bar{𝐲}\frac 1 n \sum_i 1\\ &=\frac 1 n \sum_i x_i y_i-\bar{𝐱}\bar{𝐲}- \bar{𝐱}\bar{𝐲}+\bar{𝐱}\bar{𝐲}\\ &=\frac 1 n \sum_i x_i y_i-\bar{𝐱}\bar{𝐲}\\ \end{aligned}

Covariance

$\mathrm{cov}(𝐱,𝐲)=\frac 1 n \sum_i (x_i-\bar{𝐱})(y_i-\bar{𝐲})=\frac 1 n \sum_i x_i y_i-\bar{𝐱}\bar{𝐲}$

The second formula is easier to calculate

“The average of the products minus the product of the averages”

Interpretation of Covariance

If $$𝐱$$ and $$𝐲$$ go in the same direction,
then the covariance is positive

If $$𝐱$$ and $$𝐲$$ go in oposite directions,
then the covariance is negative

Covariance under change of scale

It is easy to see that, for any constants $$a$$ and $$b$$, we have \begin{aligned} \mathrm{cov}(a\, 𝐱,𝐲)&=a\, \mathrm{cov}(𝐱,𝐲)\\ \mathrm{cov}(𝐱, b\,𝐲)&=b\, \mathrm{cov}(𝐱,𝐲)\\ \mathrm{cov}(a\, 𝐱, b\,𝐲)&=ab\, \mathrm{cov}(𝐱,𝐲)\\ \end{aligned} It would be nice to have a “covariance” value that is independent of the scale

Correlation

One way to be independent of the scale is to use $\mathrm{corr}(𝐱,𝐲)=\frac{\mathrm{cov}(𝐱,𝐲)}{\mathrm{sdev}(𝐱)\mathrm{sdev}(𝐲)}$ This is the correlation between $$𝐱$$ and $$𝐲$$

It is always a value between -1 and 1

(The proof is long and we do not need it in this course)