Class 15: Linear Models

Methodology of Scientific Research

Andrés Aravena, PhD

April 4, 2023

We have some data

Representative value

Last week we saw that “average” depends on how we measure “how bad is it”

This measurement is done using an “error function”, and we find the value that makes it smallest

We discussed two error functions:

  • absolute error, which is minimized by the median
  • squared error, which is minimized by the mean

Squared Error is a function

A function is a rule that takes a number and gives another number

In this case \(\mathrm{SE}(β)\) takes \(β\) and returns the squared error

Viewing functions as plots

\(\mathrm{SE}(β)\) has different slopes on each place

Straight tangent lines

The red and blue lines corresponds to equations like \[y=ax+b\] where

  • \(a\) is the slope
  • \(b\) is the place where the line intercepts the y-axis

This is called equation of the straight line or linear equation

Each position has a slope

For any value \(β\) we can find the slope of \(\mathrm{SE}\) at position \(β\)

This is called the derivative of \(\mathrm{SE}\)

Some simple cases

  • derivative of \(a⋅β\) is \(a\)
  • derivative of a constant is 0
  • derivative of \(β^2\) is \(2β\)
  • derivative of \(β^n\) is \(n⋅β^{n-1}\)

The smallest value has slope=0

To find the smallest value we use derivatives

To find the value of \(β\) that minimizes \(\mathrm{SE}(β)\) we

  • Calculate the derivative of \(\mathrm{SE}(β)\), written as \[\frac{d\mathrm{SE}}{dβ}(β)\]

  • Find \(β\) such that the derivative is zero. That is, solve \[\frac{d\mathrm{SE}}{dβ}(β)=0\]

That is how we find the mean

We have \(\mathrm{SE}(β)=\sum_i (y_i-β)^2\). The derivative is \[\frac{d}{dβ} \mathrm{SE}(β)= 2\sum_i (y_i - β)= 2\sum_i y_i - 2nβ\] Then we need to find \(β\) such that \[2\sum_i y_i - 2nβ = 0\]

Solving for the best \(β\)

The equation we want to solve is \[2\sum_i y_i - 2nβ = 0\]

The smallest squared error is obtained when \[β = \frac{1}{n} \sum_i y_i\]

In the cancer survival data

Solid line marks the mean

Using more information

Survival time versus tumor volume

We will use the tumor volume for a better description of survival time

Squared error of a straight line

\[SE(β_0, β_1) = \sum_i (y_i - β_0 - β_1 x_i)^2\] This time we need two derivatives \[\begin{aligned} \frac{d}{dβ_0} SE(β_0, β_1) &= 2\sum_i (y_i - β_0 - β_1 x_i)\\ \frac{d}{dβ_1} SE(β_0, β_1) &= 2\sum_i (y_i - β_0 - β_1 x_i)⋅x_i \end{aligned}\] Each one must be equal to 0

First equation

The first equation to solve is \(\frac{d}{dβ_0} SE(β_0, β_1) = 0\)

That is, we look for \(β_0\) such that \[2\sum_i (y_i - β_0 - β_1 x_i) = 0\] We can divide by 2 and expand the parenthesis \[\sum_i y_i - \sum_i β_0 - \sum_i β_1 x_i = 0\]

First solution

If \(\sum_i y_i - \sum_i β_0 - \sum_i β_1 x_i = 0\) then \[\sum_i y_i = n\cdot β_0 + β_1\sum_i x_i\] Therefore, dividing by \(n\), we have \[\overline{𝐲} =β_0 + β_1 \overline{𝐱}\] In other words, we have \[β_0 = \overline{𝐲} - β_1 \overline{𝐱}\]

Second equation

We want to solve \(\frac{d}{dβ_1} SE(β_0, β_1) = 0\)

That is, we want to find \(β_1\) such that \[2\sum_i (y_i - β_0 - β_1 x_i)⋅x_i = 0\] Dropping the 2 and expanding the parenthesis we have \[\sum_i x_i y_i - \sum_i β_0 x_i - \sum_i β_1 x_i^2 = 0\]

Tidying up

We have \[\sum_i x_iy_i - β_0\sum_i x_i - β_1\sum_i x_i^2 = 0\] It is convenient to divide everything by \(n\) \[\begin{aligned} \frac 1 n \sum_i x_iy_i - β_0\frac 1 n \sum_i x_i - β_1\frac 1 n \sum_i x_i^2 &= 0\\ \frac 1 n \sum_i x_iy_i - β_0\overline{𝐱}- β_1 \frac 1 n \sum_i x_i^2 &=0\\ \end{aligned}\]

Replacing \(β_0\)

Since \(β_0 = \overline{𝐲} - β_1 \overline{𝐱}\) we have \[\begin{aligned} \frac 1 n \sum_i x_iy_i - (\overline{𝐲} - β_1 \overline{𝐱}) \overline{𝐱} - β_1\frac 1 n \sum_i x_i^2 &=0\\ \frac 1 n \sum_i x_iy_i - \overline{𝐱}\overline{𝐲} + β_1 \overline{𝐱}^2 - β_1\frac 1 n \sum_i x_i^2 &=0\\ \left(\frac 1 n \sum_i x_iy_i - \overline{𝐱}\overline{𝐲}\right) - β_1 \left( \frac 1 n \sum_i x_i^2 - \overline{𝐱}^2\right) &=0\\ \end{aligned}\]

We have seen this before

The best \(β_1\) is the solution of \[\left(\frac 1 n \sum_i x_iy_i - \overline{𝐱}\overline{𝐲}\right) - β_1 \left( \frac 1 n \sum_i x_i^2 - \overline{𝐱}^2\right) =0\] We have seen these formulas last class \[\text{cov}(𝐱, 𝐲) - β_1 \text{var}(𝐱) =0\]

Solution

If \[\text{cov}(𝐱, 𝐲) - β_1 \text{var}(𝐱) =0\] Then the best \(β_1\) is \[β_1 = \frac{\text{cov}(𝐱, 𝐲)}{\text{var}(𝐱)}\]

Summary

The best straight line is

\[y = β_0 + β_1 x\] where \[\begin{aligned} β_0 &= \overline{𝐲} - β_1 \overline{𝐱}\\ β_1 &= \frac{\text{cov}(𝐱, 𝐲)}{\text{var}(𝐱)} \end{aligned}\]

Graphically