Choosing “the best” representative depends on the way we measure “how bad is it”

Once we choose an *error function*, we look for the value that
gives the smallest error

(we say it *minimizes* the error function)

Median minimizes the absolute error

Mean minimizes the squared error

We found that the average \(\bar{𝐲}\) is the value \(β\) that minimizes the squared error \[\mathrm{SE}_𝐲 (β)=\sum_i (y_i-β)^2\] This is our initial measure of “quality of representative”

Larger values of squared error are bad

Why makes the squared error to be large?

- Data values \(y_i\) separate and get more spread
- There are more values \(y_i\) in the set

The first part is good, it is what we want

But the second is unfortunate

How can we correct it?

To compensate, we divide by the number of values \[\mathrm{MSE}_𝐲 (β)=\frac 1 n \sum_i (y_i-β)^2\]

The smallest MSE is achieved when \(β\) is the mean \(\bar{𝐲}\) \[\text{Smallest } \mathrm{MSE}_𝐲 (\bar{𝐲})=\frac 1 n \sum_i (y_i-\bar{𝐲})^2\]

This value is called *variance* of \(𝐲\)

\[\begin{aligned} \mathrm{var}(𝐲)&=\frac 1 n \sum_i (y_i-\bar{𝐲})^2=\frac 1 n \sum_i (y_i^2-2\bar{𝐲}y_i+ \bar{𝐲}^2)\\ &=\frac 1 n \sum_i y_i^2-2\bar{𝐲}\frac 1 n \sum_i y_i+ \bar{𝐲}^2\frac 1 n \sum_i 1\\ &=\frac 1 n \sum_i y_i^2-2\bar{𝐲}\bar{𝐲}+ \bar{𝐲}^2\frac 1 n n\\ &=\frac 1 n \sum_i y_i^2-2\bar{𝐲}^2+ \bar{𝐲}^2\\ &=\frac 1 n \sum_i y_i^2-\bar{𝐲}^2\\ \end{aligned}\]

\[\mathrm{var}(𝐲)=\frac 1 n \sum_i (y_i-\bar{𝐲})^2=\frac 1 n \sum_i y_i^2-\bar{𝐲}^2\]

“The average of the squares minus the square of the average”

We saw that \[\frac 1 n \sum_i y_i^2≥\bar{𝐲}^2\] Therefore we always have \[\frac 1 n \sum_i y_i^2-\bar{𝐲}^2≥0\]

The units of the variance are squared

If \(𝐲\) is in meters, then \(\mathrm{var}(𝐲)\) is in squared meters

Often it is better to use the original units

In that case we use the *standard deviation*

\[\mathrm{sdev}(𝐲)=\sqrt{\mathrm{var}(𝐲)}\]

All values \(y_i\) are multiplied by a fixed constant \(k\)

\[\begin{aligned} \mathrm{var}(k⋅𝐲) &= k^2⋅\mathrm{var}(𝐲)\\ \mathrm{sdev}(k⋅𝐲) &= k⋅\mathrm{sdev}(𝐲) \end{aligned}\]

Multiplicative constants increase the variance quadratically

Standard deviation increases in direct proportion

\[\begin{aligned} \mathrm{var}(𝐱+𝐲)&=\frac 1 n \sum_i (x_i+ y_i-\bar{𝐱}-\bar{𝐲})^2\\ &=\frac 1 n \sum_i ((x_i-\bar{𝐱})+ (y_i-\bar{𝐲}))^2\\ &=\frac 1 n \sum_i \left((x_i-\bar{𝐱})^2 +(y_i-\bar{𝐲})^2+ 2(x_i-\bar{𝐱})(y_i-\bar{𝐲})\right)\\ &=\frac 1 n \sum_i (x_i-\bar{𝐱})^2 +\frac 1 n \sum_i (y_i-\bar{𝐲})^2+ 2\frac 1 n \sum_i (x_i-\bar{𝐱})(y_i-\bar{𝐲})\\ &=\mathrm{var}(𝐱) +\mathrm{var}(𝐲)+ 2\frac 1 n \sum_i (x_i-\bar{𝐱})(y_i-\bar{𝐲}) \end{aligned}\]

The expression \[\frac 1 n \sum_i
(x_i-\bar{𝐱})(y_i-\bar{𝐲})\] is called *covariance* of
\(𝐱\) and \(𝐲\)

We write it as \[\mathrm{cov}(𝐱,𝐲)\]

\[ \mathrm{var}(𝐱+𝐲)=\mathrm{var}(𝐱) +\mathrm{var}(𝐲)+ 2\mathrm{cov}(𝐱,𝐲) \]

The variance of the sum is the sum of the variances plus twice the covariance

\[\begin{aligned} \frac 1 n \sum_i (x_i-\bar{𝐱})(y_i-\bar{𝐲})&=\frac 1 n \sum_i (x_i y_i-\bar{𝐱}y_i+x_i\bar{𝐲}-\bar{𝐱}\bar{𝐲})\\ &=\frac 1 n \sum_i x_i y_i-\frac 1 n \sum_i\bar{𝐱}y_i-\frac 1 n \sum_i x_i\bar{𝐲}+\frac 1 n \sum_i\bar{𝐱}\bar{𝐲}\\ &=\frac 1 n \sum_i x_i y_i-\bar{𝐱}\frac 1 n \sum_i y_i - \bar{𝐲}\frac 1 n \sum_i x_i + \bar{𝐱}\bar{𝐲}\frac 1 n \sum_i 1\\ &=\frac 1 n \sum_i x_i y_i-\bar{𝐱}\bar{𝐲}- \bar{𝐱}\bar{𝐲}+\bar{𝐱}\bar{𝐲}\\ &=\frac 1 n \sum_i x_i y_i-\bar{𝐱}\bar{𝐲}\\ \end{aligned}\]

\[\mathrm{cov}(𝐱,𝐲)=\frac 1 n \sum_i (x_i-\bar{𝐱})(y_i-\bar{𝐲})=\frac 1 n \sum_i x_i y_i-\bar{𝐱}\bar{𝐲}\]

The second formula is easier to calculate

“The average of the products minus the product of the averages”

If \(𝐱\) and \(𝐲\) go in the same direction,

then the covariance is positive

If \(𝐱\) and \(𝐲\) go in oposite directions,

then the covariance is negative

It is easy to see that, for any constants \(a\) and \(b\), we have \[\begin{aligned} \mathrm{cov}(a\, 𝐱,𝐲)&=a\, \mathrm{cov}(𝐱,𝐲)\\ \mathrm{cov}(𝐱, b\,𝐲)&=b\, \mathrm{cov}(𝐱,𝐲)\\ \mathrm{cov}(a\, 𝐱, b\,𝐲)&=ab\, \mathrm{cov}(𝐱,𝐲)\\ \end{aligned}\] It would be nice to have a “covariance” value that is independent of the scale

One way to be independent of the scale is to use \[\mathrm{corr}(𝐱,𝐲)=\frac{\mathrm{cov}(𝐱,𝐲)}{\mathrm{sdev}(𝐱)\mathrm{sdev}(𝐲)}\]
This is the *correlation* between \(𝐱\) and \(𝐲\)

It is always a value between -1 and 1

(The proof is long and we do not need it in this course)