Last week we saw that “average” depends on how we measure “how bad is it”
This measurement is done using an “error function”, and we find the value that makes it smallest
We discussed two error functions:
A function is a rule that takes a number and gives another number
In this case \(\mathrm{SE}(β)\) takes \(β\) and returns the squared error
The red and blue lines corresponds to equations like \[y=ax+b\] where
This is called equation of the straight line or linear equation
For any value \(β\) we can find the slope of \(\mathrm{SE}\) at position \(β\)
This is called the derivative of \(\mathrm{SE}\)
Some simple cases
To find the value of \(β\) that minimizes \(\mathrm{SE}(β)\) we
Calculate the derivative of \(\mathrm{SE}(β)\), written as \[\frac{d\mathrm{SE}}{dβ}(β)\]
Find \(β\) such that the derivative is zero. That is, solve \[\frac{d\mathrm{SE}}{dβ}(β)=0\]
We have \(\mathrm{SE}(β)=\sum_i (y_i-β)^2\). The derivative is \[\frac{d}{dβ} \mathrm{SE}(β)= 2\sum_i (y_i - β)= 2\sum_i y_i - 2nβ\] Then we need to find \(β\) such that \[2\sum_i y_i - 2nβ = 0\]
The equation we want to solve is \[2\sum_i y_i - 2nβ = 0\]
The smallest squared error is obtained when \[β = \frac{1}{n} \sum_i y_i\]
Solid line marks the mean
We will use the tumor volume for a better description of survival time
\[SE(β_0, β_1) = \sum_i (y_i - β_0 - β_1 x_i)^2\] This time we need two derivatives \[\begin{aligned} \frac{d}{dβ_0} SE(β_0, β_1) &= 2\sum_i (y_i - β_0 - β_1 x_i)\\ \frac{d}{dβ_1} SE(β_0, β_1) &= 2\sum_i (y_i - β_0 - β_1 x_i)⋅x_i \end{aligned}\] Each one must be equal to 0
The first equation to solve is \(\frac{d}{dβ_0} SE(β_0, β_1) = 0\)
That is, we look for \(β_0\) such that \[2\sum_i (y_i - β_0 - β_1 x_i) = 0\] We can divide by 2 and expand the parenthesis \[\sum_i y_i - \sum_i β_0 - \sum_i β_1 x_i = 0\]
If \(\sum_i y_i - \sum_i β_0 - \sum_i β_1 x_i = 0\) then \[\sum_i y_i = n\cdot β_0 + β_1\sum_i x_i\] Therefore, dividing by \(n\), we have \[\overline{𝐲} =β_0 + β_1 \overline{𝐱}\] In other words, we have \[β_0 = \overline{𝐲} - β_1 \overline{𝐱}\]
We want to solve \(\frac{d}{dβ_1} SE(β_0, β_1) = 0\)
That is, we want to find \(β_1\) such that \[2\sum_i (y_i - β_0 - β_1 x_i)⋅x_i = 0\] Dropping the 2 and expanding the parenthesis we have \[\sum_i x_i y_i - \sum_i β_0 x_i - \sum_i β_1 x_i^2 = 0\]
We have \[\sum_i x_iy_i - β_0\sum_i x_i - β_1\sum_i x_i^2 = 0\] It is convenient to divide everything by \(n\) \[\begin{aligned} \frac 1 n \sum_i x_iy_i - β_0\frac 1 n \sum_i x_i - β_1\frac 1 n \sum_i x_i^2 &= 0\\ \frac 1 n \sum_i x_iy_i - β_0\overline{𝐱}- β_1 \frac 1 n \sum_i x_i^2 &=0\\ \end{aligned}\]
Since \(β_0 = \overline{𝐲} - β_1 \overline{𝐱}\) we have \[\begin{aligned} \frac 1 n \sum_i x_iy_i - (\overline{𝐲} - β_1 \overline{𝐱}) \overline{𝐱} - β_1\frac 1 n \sum_i x_i^2 &=0\\ \frac 1 n \sum_i x_iy_i - \overline{𝐱}\overline{𝐲} + β_1 \overline{𝐱}^2 - β_1\frac 1 n \sum_i x_i^2 &=0\\ \left(\frac 1 n \sum_i x_iy_i - \overline{𝐱}\overline{𝐲}\right) - β_1 \left( \frac 1 n \sum_i x_i^2 - \overline{𝐱}^2\right) &=0\\ \end{aligned}\]
The best \(β_1\) is the solution of \[\left(\frac 1 n \sum_i x_iy_i - \overline{𝐱}\overline{𝐲}\right) - β_1 \left( \frac 1 n \sum_i x_i^2 - \overline{𝐱}^2\right) =0\] We have seen these formulas last class \[\text{cov}(𝐱, 𝐲) - β_1 \text{var}(𝐱) =0\]
If \[\text{cov}(𝐱, 𝐲) - β_1 \text{var}(𝐱) =0\] Then the best \(β_1\) is \[β_1 = \frac{\text{cov}(𝐱, 𝐲)}{\text{var}(𝐱)}\]
The best straight line is
\[y = β_0 + β_1 x\] where \[\begin{aligned} β_0 &= \overline{𝐲} - β_1 \overline{𝐱}\\ β_1 &= \frac{\text{cov}(𝐱, 𝐲)}{\text{var}(𝐱)} \end{aligned}\]