Last week we saw that “average” depends on how we measure “how bad is it”

This measurement is done using an “error function”, and we find the value that makes it smallest

We discussed two error functions:

- absolute error, which is minimized by the median
- squared error, which is minimized by the mean

A function is a rule that takes a number and gives another number

In this case \(\mathrm{SE}(β)\) takes \(β\) and returns the squared error

The red and blue lines corresponds to equations like \[y=ax+b\] where

- \(a\) is the
*slope* - \(b\) is the place where the line
*intercepts*the y-axis

This is called *equation of the straight line* or *linear
equation*

For any value \(β\) we can find the slope of \(\mathrm{SE}\) at position \(β\)

This is called the *derivative* of \(\mathrm{SE}\)

Some simple cases

- derivative of \(a⋅β\) is \(a\)
- derivative of a constant is 0
- derivative of \(β^2\) is \(2β\)
- derivative of \(β^n\) is \(n⋅β^{n-1}\)

To find the value of \(β\) that minimizes \(\mathrm{SE}(β)\) we

Calculate the derivative of \(\mathrm{SE}(β)\), written as \[\frac{d\mathrm{SE}}{dβ}(β)\]

Find \(β\) such that the derivative is zero. That is, solve \[\frac{d\mathrm{SE}}{dβ}(β)=0\]

We have \(\mathrm{SE}(β)=\sum_i (y_i-β)^2\). The derivative is \[\frac{d}{dβ} \mathrm{SE}(β)= 2\sum_i (y_i - β)= 2\sum_i y_i - 2nβ\] Then we need to find \(β\) such that \[2\sum_i y_i - 2nβ = 0\]

The equation we want to solve is \[2\sum_i y_i - 2nβ = 0\]

The smallest squared error is obtained when \[β = \frac{1}{n} \sum_i y_i\]

Solid line marks the mean

We will use the tumor volume for a better description of survival time

\[SE(β_0, β_1) = \sum_i (y_i - β_0 - β_1 x_i)^2\] This time we need two derivatives \[\begin{aligned} \frac{d}{dβ_0} SE(β_0, β_1) &= 2\sum_i (y_i - β_0 - β_1 x_i)\\ \frac{d}{dβ_1} SE(β_0, β_1) &= 2\sum_i (y_i - β_0 - β_1 x_i)⋅x_i \end{aligned}\] Each one must be equal to 0

The first equation to solve is \(\frac{d}{dβ_0} SE(β_0, β_1) = 0\)

That is, we look for \(β_0\) such that \[2\sum_i (y_i - β_0 - β_1 x_i) = 0\] We can divide by 2 and expand the parenthesis \[\sum_i y_i - \sum_i β_0 - \sum_i β_1 x_i = 0\]

If \(\sum_i y_i - \sum_i β_0 - \sum_i β_1 x_i = 0\) then \[\sum_i y_i = n\cdot β_0 + β_1\sum_i x_i\] Therefore, dividing by \(n\), we have \[\overline{𝐲} =β_0 + β_1 \overline{𝐱}\] In other words, we have \[β_0 = \overline{𝐲} - β_1 \overline{𝐱}\]

We want to solve \(\frac{d}{dβ_1} SE(β_0, β_1) = 0\)

That is, we want to find \(β_1\) such that \[2\sum_i (y_i - β_0 - β_1 x_i)⋅x_i = 0\] Dropping the 2 and expanding the parenthesis we have \[\sum_i x_i y_i - \sum_i β_0 x_i - \sum_i β_1 x_i^2 = 0\]

We have \[\sum_i x_iy_i - β_0\sum_i x_i - β_1\sum_i x_i^2 = 0\] It is convenient to divide everything by \(n\) \[\begin{aligned} \frac 1 n \sum_i x_iy_i - β_0\frac 1 n \sum_i x_i - β_1\frac 1 n \sum_i x_i^2 &= 0\\ \frac 1 n \sum_i x_iy_i - β_0\overline{𝐱}- β_1 \frac 1 n \sum_i x_i^2 &=0\\ \end{aligned}\]

Since \(β_0 = \overline{𝐲} - β_1 \overline{𝐱}\) we have \[\begin{aligned} \frac 1 n \sum_i x_iy_i - (\overline{𝐲} - β_1 \overline{𝐱}) \overline{𝐱} - β_1\frac 1 n \sum_i x_i^2 &=0\\ \frac 1 n \sum_i x_iy_i - \overline{𝐱}\overline{𝐲} + β_1 \overline{𝐱}^2 - β_1\frac 1 n \sum_i x_i^2 &=0\\ \left(\frac 1 n \sum_i x_iy_i - \overline{𝐱}\overline{𝐲}\right) - β_1 \left( \frac 1 n \sum_i x_i^2 - \overline{𝐱}^2\right) &=0\\ \end{aligned}\]

The best \(β_1\) is the solution of \[\left(\frac 1 n \sum_i x_iy_i - \overline{𝐱}\overline{𝐲}\right) - β_1 \left( \frac 1 n \sum_i x_i^2 - \overline{𝐱}^2\right) =0\] We have seen these formulas last class \[\text{cov}(𝐱, 𝐲) - β_1 \text{var}(𝐱) =0\]

If \[\text{cov}(𝐱, 𝐲) - β_1 \text{var}(𝐱) =0\] Then the best \(β_1\) is \[β_1 = \frac{\text{cov}(𝐱, 𝐲)}{\text{var}(𝐱)}\]

The best straight line is

\[y = β_0 + β_1 x\] where \[\begin{aligned} β_0 &= \overline{𝐲} - β_1 \overline{𝐱}\\ β_1 &= \frac{\text{cov}(𝐱, 𝐲)}{\text{var}(𝐱)} \end{aligned}\]