Assume we have a vector of \(n\) values \[𝐲=\{y_1, y_2, …, y_n \}\] If we want to describe the set \(𝐲\) with a single number \(x\), which would it be?

If we have to replace each one of \(y_i\) for a single number, which number is “the best”?

Better choose one that is the “less wrong”

How can \(x\) be wrong?

Many alternatives to measure the error

- Number of times that \(x≠y_i\)
- Sum of absolute value of error
- Sum of the square of error

and maybe other ways

Today we will use the square of the error

The squared error when \(x\) represents \(𝐲\) is \[\mathrm{SE}(x)=\sum_i (y_i-x)^2\] Which \(x\) minimizes the squared error?

We can write \[\begin{aligned} \mathrm{SE}(x)&=\sum_i (y_i-x)^2 =\sum_i (y_i^2 - 2y_ix + x^2)\\ &=\sum_i y_i^2 - \sum_i 2 y_ix + \sum_i x^2\\ &=\sum_i y_i^2 - x\sum_i 2 y_i + n x^2\\ \end{aligned}\]

This is a second degree expression, corresponding to a parabola

We have \[\mathrm{SE}(x) =\underbrace{n}_a x^2 - \underbrace{\sum_i 2 y_i}_b \, x+ \underbrace{\sum_i y_i^2}_c\] which has the form of \(ax^2+ bx + c\)

Let’s explore it in Geogebra

When we have \(ax^2+ bx + c =0\) then the two roots are \[\begin{aligned} x_1 &= \frac{-b-\sqrt{b^2-4ac} }{2a}\\ x_2 &= \frac{-b+\sqrt{b^2-4ac} }{2a} \end{aligned}\] and the middle point is \[\frac{x_1 + x_2}{2} = \frac{-b}{2a}\]

We have \[\mathrm{SE}(x) =\underbrace{n}_a x^2 - \underbrace{\sum_i 2 y_i}_b \, x+ \underbrace{\sum_i y_i^2}_c\] so the center point is \[\frac{-b}{2a}=\frac{\sum_i 2 y_i}{2n}=\frac{\sum_i y_i}{n}\]

We get the minimum squared error when \(x\) is the mean

The *arithmetic mean* of \(𝐲\) is \[\text{mean}(𝐲) = \frac{1}{n}\sum_{i=1}^n
y_i\] where \(n\) is the size of
the set \(𝐲\).

Sometimes it is written as \(\bar{𝐲}\)

This value is usually called *mean*, sometimes
*average*

using calculus

A function is a rule that takes a number and gives another number

In this case \(\mathrm{SE}(β)\) takes \(β\) and returns the squared error

The red and blue lines corresponds to equations like \[y=ax+b\] where

- \(a\) is the
*slope* - \(b\) is the place where the line
*intercepts*the y-axis

This is called *equation of the straight line* or *linear
equation*

For any value \(β\) we can find the slope of \(\mathrm{SE}\) at position \(β\)

This is called the *derivative* of \(\mathrm{SE}\)

- derivative of \(a⋅β\) is \(a\)
- derivative of a constant is 0
- derivative of \(β^2\) is \(2β\)
- derivative of \(β^n\) is \(n⋅β^{n-1}\)

In general, we can use Wolfram Alpha (https://www.wolframalpha.com/)

We focus in the idea, not in the technique

To find the value of \(β\) that minimizes \(\mathrm{SE}(β)\) we

Calculate the derivative of \(\mathrm{SE}(β)\), written as \[\frac{d\mathrm{SE}}{dβ}(β)\]

Find \(β\) such that the derivative is zero. That is, solve \[\frac{d\mathrm{SE}}{dβ}(β)=0\]

We have \(\mathrm{SE}(β)=\sum_i (y_i-β)^2\). The derivative is \[\frac{d}{dβ} \mathrm{SE}(β)= 2\sum_i (y_i - β)= 2\sum_i y_i - 2nβ\] Then we need to find \(β\) such that \[2\sum_i y_i - 2nβ = 0\]

The equation we want to solve is \[2\sum_i y_i - 2nβ = 0\]

The smallest squared error is obtained when \[β = \frac{1}{n} \sum_i y_i\]

\[\begin{aligned} \mathrm{var}(𝐲)&=\frac 1 n \sum_i (y_i-\bar{𝐲})^2=\frac 1 n \sum_i (y_i^2-2\bar{𝐲}y_i+ \bar{𝐲}^2)\\ &=\frac 1 n \sum_i y_i^2-2\bar{𝐲}\frac 1 n \sum_i y_i+ \bar{𝐲}^2\frac 1 n \sum_i 1\\ &=\frac 1 n \sum_i y_i^2-2\bar{𝐲}\bar{𝐲}+ \bar{𝐲}^2\frac 1 n n\\ &=\frac 1 n \sum_i y_i^2-2\bar{𝐲}^2+ \bar{𝐲}^2\\ &=\frac 1 n \sum_i y_i^2-\bar{𝐲}^2\\ \end{aligned}\]

\[\mathrm{var}(𝐲)=\frac 1 n \sum_i (y_i-\bar{𝐲})^2=\frac 1 n \sum_i y_i^2-\bar{𝐲}^2\]

“The average of the squares minus the square of the average”

\[\begin{aligned} \mathrm{var}(𝐱+𝐲)&=\frac 1 n \sum_i (x_i+ y_i-\bar{𝐱}-\bar{𝐲})^2\\ &=\frac 1 n \sum_i ((x_i-\bar{𝐱})+ (y_i-\bar{𝐲}))^2\\ &=\frac 1 n \sum_i \left((x_i-\bar{𝐱})^2 +(y_i-\bar{𝐲})^2+ 2(x_i-\bar{𝐱})(y_i-\bar{𝐲})\right)\\ &=\frac 1 n \sum_i (x_i-\bar{𝐱})^2 +\frac 1 n \sum_i (y_i-\bar{𝐲})^2+ 2\frac 1 n \sum_i (x_i-\bar{𝐱})(y_i-\bar{𝐲})\\ &=\mathrm{var}(𝐱) +\mathrm{var}(𝐲)+ 2\frac 1 n \sum_i (x_i-\bar{𝐱})(y_i-\bar{𝐲}) \end{aligned}\]

The expression \[\frac 1 n \sum_i
(x_i-\bar{𝐱})(y_i-\bar{𝐲})\] is called *covariance* of
\(𝐱\) and \(𝐲\)

We write it as \[\mathrm{cov}(𝐱,𝐲)\]

\[ \mathrm{var}(𝐱+𝐲)=\mathrm{var}(𝐱) +\mathrm{var}(𝐲)+ 2\mathrm{cov}(𝐱,𝐲) \]

The variance of the sum is the sum of the variances plus twice the covariance

\[\begin{aligned} \frac 1 n \sum_i (x_i-\bar{𝐱})(y_i-\bar{𝐲})&=\frac 1 n \sum_i (x_i y_i-\bar{𝐱}y_i+x_i\bar{𝐲}-\bar{𝐱}\bar{𝐲})\\ &=\frac 1 n \sum_i x_i y_i-\frac 1 n \sum_i\bar{𝐱}y_i-\frac 1 n \sum_i x_i\bar{𝐲}+\frac 1 n \sum_i\bar{𝐱}\bar{𝐲}\\ &=\frac 1 n \sum_i x_i y_i-\bar{𝐱}\frac 1 n \sum_i y_i - \bar{𝐲}\frac 1 n \sum_i x_i + \bar{𝐱}\bar{𝐲}\frac 1 n \sum_i 1\\ &=\frac 1 n \sum_i x_i y_i-\bar{𝐱}\bar{𝐲}- \bar{𝐱}\bar{𝐲}+\bar{𝐱}\bar{𝐲}\\ &=\frac 1 n \sum_i x_i y_i-\bar{𝐱}\bar{𝐲}\\ \end{aligned}\]

\[\mathrm{cov}(𝐱,𝐲)=\frac 1 n \sum_i (x_i-\bar{𝐱})(y_i-\bar{𝐲})=\frac 1 n \sum_i x_i y_i-\bar{𝐱}\bar{𝐲}\]

The second formula is easier to calculate

“The average of the products minus the product of the averages”

If \(𝐱\) and \(𝐲\) go in the same direction,

then the covariance is positive

If \(𝐱\) and \(𝐲\) go in oposite directions,

then the covariance is negative

It is easy to see that, for any constants \(a\) and \(b\), we have \[\begin{aligned} \mathrm{cov}(a\, 𝐱,𝐲)&=a\, \mathrm{cov}(𝐱,𝐲)\\ \mathrm{cov}(𝐱, b\,𝐲)&=b\, \mathrm{cov}(𝐱,𝐲)\\ \mathrm{cov}(a\, 𝐱, b\,𝐲)&=ab\, \mathrm{cov}(𝐱,𝐲)\\ \end{aligned}\] It would be nice to have a “covariance” value that is independent of the scale

One way to be independent of the scale is to use \[\mathrm{corr}(𝐱,𝐲)=\frac{\mathrm{cov}(𝐱,𝐲)}{\mathrm{sdev}(𝐱)\mathrm{sdev}(𝐲)}\]
This is the *correlation* between \(𝐱\) and \(𝐲\)

It is always a value between -1 and 1

\[SE(β_0, β_1) = \sum_i (y_i - β_0 - β_1 x_i)^2\] This time we need two derivatives \[\begin{aligned} \frac{d}{dβ_0} SE(β_0, β_1) &= 2\sum_i (y_i - β_0 - β_1 x_i)\\ \frac{d}{dβ_1} SE(β_0, β_1) &= 2\sum_i (y_i - β_0 - β_1 x_i)⋅x_i \end{aligned}\] Each one must be equal to 0

The first equation to solve is \(\frac{d}{dβ_0} SE(β_0, β_1) = 0\)

That is, we look for \(β_0\) such that \[2\sum_i (y_i - β_0 - β_1 x_i) = 0\] We can divide by 2 and expand the parenthesis \[\sum_i y_i - \sum_i β_0 - \sum_i β_1 x_i = 0\]

If \(\sum_i y_i - \sum_i β_0 - \sum_i β_1 x_i = 0\) then \[\sum_i y_i = n\cdot β_0 + β_1\sum_i x_i\] Therefore, dividing by \(n\), we have \[\overline{𝐲} =β_0 + β_1 \overline{𝐱}\] In other words, we have \[β_0 = \overline{𝐲} - β_1 \overline{𝐱}\]

We want to solve \(\frac{d}{dβ_1} SE(β_0, β_1) = 0\)

That is, we want to find \(β_1\) such that \[2\sum_i (y_i - β_0 - β_1 x_i)⋅x_i = 0\] Dropping the 2 and expanding the parenthesis we have \[\sum_i x_i y_i - \sum_i β_0 x_i - \sum_i β_1 x_i^2 = 0\]

We have \[\sum_i x_iy_i - β_0\sum_i x_i - β_1\sum_i x_i^2 = 0\] It is convenient to divide everything by \(n\) \[\begin{aligned} \frac 1 n \sum_i x_iy_i - β_0\frac 1 n \sum_i x_i - β_1\frac 1 n \sum_i x_i^2 &= 0\\ \frac 1 n \sum_i x_iy_i - β_0\overline{𝐱}- β_1 \frac 1 n \sum_i x_i^2 &=0\\ \end{aligned}\]

Since \(β_0 = \overline{𝐲} - β_1 \overline{𝐱}\) we have \[\begin{aligned} \frac 1 n \sum_i x_iy_i - (\overline{𝐲} - β_1 \overline{𝐱}) \overline{𝐱} - β_1\frac 1 n \sum_i x_i^2 &=0\\ \frac 1 n \sum_i x_iy_i - \overline{𝐱}\overline{𝐲} + β_1 \overline{𝐱}^2 - β_1\frac 1 n \sum_i x_i^2 &=0\\ \left(\frac 1 n \sum_i x_iy_i - \overline{𝐱}\overline{𝐲}\right) - β_1 \left( \frac 1 n \sum_i x_i^2 - \overline{𝐱}^2\right) &=0\\ \end{aligned}\]

The best \(β_1\) is the solution of \[\left(\frac 1 n \sum_i x_iy_i - \overline{𝐱}\overline{𝐲}\right) - β_1 \left( \frac 1 n \sum_i x_i^2 - \overline{𝐱}^2\right) =0\] We have seen these formulas last class \[\text{cov}(𝐱, 𝐲) - β_1 \text{var}(𝐱) =0\]

If \[\text{cov}(𝐱, 𝐲) - β_1 \text{var}(𝐱) =0\] Then the best \(β_1\) is \[β_1 = \frac{\text{cov}(𝐱, 𝐲)}{\text{var}(𝐱)}\]

The best straight line is

\[y = β_0 + β_1 x\] where \[\begin{aligned} β_0 &= \overline{𝐲} - β_1 \overline{𝐱}\\ β_1 &= \frac{\text{cov}(𝐱, 𝐲)}{\text{var}(𝐱)} \end{aligned}\]