Our observations and experiments give us data
We want to tell something about them
What can we tell about this set of numbers?
How can we make a summary of all the values in a few numbers?
Assume we have a vector of \(n\) values \[𝐲=\{y_1, y_2, …, y_n \}\] If we want to describe the set \(𝐲\) with a single number \(x\), which would it be?
If we have to replace each one of \(y_i\) for a single number, which number is “the best”?
Better choose one that is the “less wrong”
How can \(x\) be wrong?
Many alternatives to measure the error
and maybe others
Absolute error when \(x\) represents \(𝐲\) \[\mathrm{AE}(x)=\sum_i |y_i-x|\]
Which \(x\) minimizes absolute error?
Let’s make a spreadsheet to find which value of \(x\) minimizes the absolute error for the set
\[\{3,5,8\}\]
Let’s go to Google Sheets
We get the minimum absolute error when
In other words, \(x\) is the median of \(𝐲\)
The median minimizes the absolute error
We must sort all values, from smallest to largest, and pick the one in the middle
If there are an even number of values, there are two values (let’s say \(y_a\) and \(y_b\)) on the center
In that cases the median is \[\frac{y_a + y_b}{2}\]
Since we have to sort all values, this can take a lot of time
Before electronic computers, people had to sort things manually
It was impossible to do if you had too many values
Instead, people used methods that did not require sorting
The squared error when \(x\) represents \(𝐲\) is \[\mathrm{SE}(x)=\sum_i (y_i-x)^2\] Which \(x\) minimizes the squared error?
Let’s make a spreadsheet to find which value of \(x\) minimizes the squared error for the set
\[\{3,5,8\}\]
We can write \[\begin{aligned} \mathrm{SE}(x)&=\sum_i (y_i-x)^2 =\sum_i (y_i^2 - 2y_ix + x^2)\\ &=\sum_i y_i^2 - \sum_i 2 y_ix + \sum_i x^2\\ &=\sum_i y_i^2 - x\sum_i 2 y_i + n x^2\\ \end{aligned}\]
This is a second degree expression, corresponding to a parabola
We have \[\mathrm{SE}(x) =\underbrace{n}_a x^2 - \underbrace{\sum_i 2 y_i}_b \, x+ \underbrace{\sum_i y_i^2}_c\] which has the form of \(ax^2+ bx + c\)
Let’s explore it in Geogebra
When we have \(ax^2+ bx + c =0\) then the two roots are \[\begin{aligned} x_1 &= \frac{-b-\sqrt{b^2-4ac} }{2a}\\ x_2 &= \frac{-b+\sqrt{b^2-4ac} }{2a} \end{aligned}\] and the middle point is \[\frac{x_1 + x_2}{2} = \frac{-b}{2a}\]
We have \[\mathrm{SE}(x) =\underbrace{n}_a x^2 - \underbrace{\sum_i 2 y_i}_b \, x+ \underbrace{\sum_i y_i^2}_c\] so the center point is \[\frac{-b}{2a}=\frac{\sum_i 2 y_i}{2n}=\frac{\sum_i y_i}{n}\]
We get the minimum squared error when \(x\) is the mean
The arithmetic mean of \(𝐲\) is \[\text{mean}(𝐲) = \frac{1}{n}\sum_{i=1}^n y_i\] where \(n\) is the size of the set \(𝐲\).
Sometimes it is written as \(\bar{𝐲}\)
This value is usually called mean, sometimes average
usually
In the Squared Error formula, all values are positives
The parabola never crosses the horizontal axis
Therefore, there are no real roots, only imaginary ones
That happens when \[b^2-4ac≤0\]
We will use this result later
Replacing the values in \(b^2-4ac≤0\) we have \[\left(\sum_i 2 y_i\right)^2 - 4 n \sum_i y_i^2 ≤ 0\]
In other words, we must remember that \[\left(\sum_i y_i\right)^2 ≤ n\sum_i y_i^2\]
The error is \[\mathrm{SE}(x)=\sum_i (y_i-x)^2\]
To find the minimal value we take the derivative of \(SE\)
\[\frac{d}{dx} \mathrm{SE}(x)= 2\sum_i (y_i - x)= 2\sum_i y_i - 2nx\]
The minimal values of functions are located where the derivative is zero
Now we find the value of \(x\) that makes the derivative equal to zero.
\[\frac{d}{dx} \mathrm{SE}(x)= 2\sum_i y_i - 2nx\]
Making this last formula equal to zero and solving for \(x\) we found that the best one is
\[x = \frac{1}{n} \sum_i y_i\]
We do not need a lot of calculus
We show just some of the reasons why calculus is useful
All that, after the midterms
All values \(y_i\) are multiplied by a fixed constant \(k\)
\[\begin{aligned} \mathrm{mean}(k⋅𝐲) &= \frac{1}{n}\sum_{i=1}^n k⋅y_i\\ &= k⋅\frac{1}{n}\sum_{i=1}^n y_i\\ &= k⋅\mathrm{mean}(𝐲)\\ \end{aligned}\]
\[\begin{aligned} \mathrm{mean}(𝐱+𝐲) &= \frac{1}{n}\sum_{i=1}^n (x_i+y_i)\\ &= \frac{1}{n}\sum_{i=1}^n x_i + \frac{1}{n}\sum_{i=1}^n y_i\\ &= \mathrm{mean}(𝐱)+\mathrm{mean}(𝐲)\\ \end{aligned}\]
For any numbers \(a\) and \(b\) we have \[\mathrm{mean}(a 𝐱 + b𝐲) = a⋅\mathrm{mean}(𝐱)+b⋅\mathrm{mean}(𝐲)\]
We say that the mean is linear (official name)
but a better name is additive
Comment about notation
We will work with sets of values, like \[\{y_1, y_2, …, y_n \}\] When we speak about all the set, we write \(𝐲\) in bold face
Sometimes the order is important.
In that case we write it as a vector or tuple \[(y_1, y_2, …, y_n)\] With \(\{…\}\) the order doesn’t matter. With \((…)\) it matters