Measured = Real ⊕ Instrument ⊕ Noise ⊕ Diversity \[y_m(x,i,t) = f(x)⊕I(x,t)⊕v(i)⊕b(x,i)\]

Normalized = Real ⊕ Noise ⊕ Diversity \[y(x,i) = f(x)⊕v(i)⊕b(x,i)\]

\[w_N(h,s,i) = w(h,s)⊕v(i)⊕b(h,s,i)\]

\[\underbrace{y_i}_{w_N(h,s,i)} = \underbrace{β_0 + β_1 x_i + β_2 s_i}_{w(h,s)} + \underbrace{e_i}_{v(i)⊕b(h,s,i)}\]

(To make it more general, we use \(x_i\) to represent height)

This way the formulas will not be *only* about weight and
height

The \(e_i\) values will be important later

They are called “errors” or “residuals”

Since there is no systematic error, we must have \[\sum_i e_i=0\] (otherwise we can change \(\beta_0\) or \(\beta_1\))

sex | birth_date | height | weight | handedness | hand_span | |
---|---|---|---|---|---|---|

1 | Male | 1993-02-01 | 179 | 67 | Right | 15 |

2 | Female | 1998-05-21 | 168 | 55 | Right | 14 |

4 | Male | 1998-08-29 | 170 | 74 | Right | 25 |

5 | Female | 1998-05-03 | 162 | 68 | Right | 13 |

6 | Female | 1995-10-09 | 167 | 58 | Right | 18 |

7 | Female | 1997-09-19 | 174 | 72 | Right | 16 |

8 | Male | 1997-11-27 | 180 | 68 | Right | 19 |

9 | Female | 1999-01-02 | 162 | 58 | Right | 19 |

10 | Female | 1998-10-02 | 172 | 55 | Right | 20 |

11 | Male | 1997-05-18 | 181 | 81 | Right | 20 |

This is the simplest model for the set \(𝐲\)

\[y_i = β_0 + e_i\]

We look for a constant \(β_0\) representing all values

The error made by this approximation is \(e_i\)

The best \(β_0\) is found to be

\[\text{mean}(𝐲)=\overline{𝐲}=\frac{1}{n}\sum_i y_i\]

`[1] 66.36869`

This is the *mean square error* of the model \[MSE_0=\text{var}(𝐲) =\frac{1}{n}\sum_i (y_i -
\overline{𝐲})^2\]

`[1] 170.2504`

It is easy to see that taller people will be heavier

You need more bones to be taller, and more muscles to move

We will model using this expression \[y_i = β_0 + β_1 x_i + e_i\] where \[\begin{aligned} β_0 &= \overline{𝐲} - β_1 \overline{𝐱}\\ β_1 &= \frac{\text{cov}(𝐱, 𝐲)}{\text{var}(𝐱)} \end{aligned}\]

```
(Intercept) height
-84.007888 0.882157
```

This time instead of \[MSE_0=\frac{1}{n}\sum_i (y_i - \overline{𝐲})^2\] we have \[MSE_1=\text{var}(𝐲) - \frac{\text{cov}^2(𝐱,𝐲)}{\text{var}(𝐱)}\]

Thus, mean square error is better in model 1 than in model 0

The variance in the new model is better, but how much? \[\frac{MSE_0-MSE_1}{MSE_0}=\frac{\text{cov}^2(𝐱,𝐲)}{\text{var}(𝐱)\text{var}(𝐲)}\] This number represents the percentage of the original variance that is explained by the new model

The name of this number is \(R^2\)

The *Pearson correlation coefficient* between two variables is
\[r=\frac{\text{cov}(𝐱,𝐲)}{\text{sd}(𝐱)\text{sd}(𝐲)}\]
so we have **in this case** that \[R^2 = r^2\] This is valid for linear
models with a single independent variable. It will not be valid for
larger models

Our results include a *random* component (the \(e_i\))

Thus, the values of \(β_0\) and \(β_1\) are also random

But they are not far away from the *“real”* ones

More details in the next class

We can get a 95% or 99% confidence interval for the values

```
0.5 % 99.5 %
(Intercept) -132.7312228 -35.284553
height 0.5967649 1.167549
```

If the confidence interval does not contains 0, then the real value is not 0

“The evidence shows that the coefficient is not 0”

If the confidence interval contains 0,

then the real value may be 0

“We do not have evidence that the coefficient is ≠0”

```
Call:
lm(formula = weight ~ height, data = survey)
Residuals:
Min 1Q Median 3Q Max
-18.016 -7.225 0.216 7.396 35.630
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -84.0079 18.5438 -4.530 1.68e-05 ***
height 0.8822 0.1086 8.122 1.48e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 10.17 on 97 degrees of freedom
Multiple R-squared: 0.4048, Adjusted R-squared: 0.3986
F-statistic: 65.96 on 1 and 97 DF, p-value: 1.479e-12
```

The values \(β_0\) and \(β_1\) follow a Student’s t distribution

(We will show that later)

Thus, we can use a t-test to see if they are not zero

In this case we let the computer do it for us

Sex is a *factor*, so we get a *box plot*

We still have \(n\) individuals, but now we can split them in two groups

The groups are called “female” and “male”

There are \(n_0\) females and \(n_1\) males, so \(n=n_0+n_1.\)

The easiest way is to use a number \[s_i=\begin{cases}0\quad\text{if person }i\text{ is female}\\ 1\quad\text{if person }i\text{ is male}\end{cases}\]

It is easy to see that \(\sum_i s_i = n_1\)

We have now \[y_i = β_0 + β_1 s_i +
e_i\] So for all the *female* individuals, we have \[y_i = β_0 + e_i\] Taking averages we have
\[\frac{1}{n_0}\sum_{i\text{ female}}y_i =
β_0\] so \(β_0\) is the average
weight of *female* \[β_0=\text{mean}(𝐲\text{ female})\]

So for all the *male* individuals, we have \[y_i = β_0 + β_1 + e_i\] Taking averages we
have \[\frac{1}{n_1}\sum_{i\text{ male}}y_i =
β_0+ β_1\] In other words \[β_0 +
β_1=\text{mean}(𝐲\text{ male})\]

Wh have \[β_0=\text{mean}(𝐲\text{ female})\] and \[β_0 + β_1=\text{mean}(𝐲\text{ male})\] Replacing the value of \(β_0\) we have \[β_1=\text{mean}(𝐲\text{ male})-\text{mean}(𝐲\text{ female})\]

```
(Intercept) sexMale
59.57937 18.67063
0.5 % 99.5 %
(Intercept) 56.41406 62.74467
sexMale 13.42158 23.91969
```

`(Intercept)`

is the average female weight

`sexMale`

is the average extra weight for male

```
Call:
lm(formula = weight ~ sex, data = survey)
Residuals:
Min 1Q Median 3Q Max
-19.250 -5.579 -1.579 5.421 27.750
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 59.579 1.205 49.456 < 2e-16 ***
sexMale 18.671 1.998 9.346 3.47e-15 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.562 on 97 degrees of freedom
Multiple R-squared: 0.4738, Adjusted R-squared: 0.4684
F-statistic: 87.34 on 1 and 97 DF, p-value: 3.47e-15
```

```
(Intercept) handednessRight
67.0000000 -0.7022472
0.5 % 99.5 %
(Intercept) 56.04894 77.95106
handednessRight -12.25216 10.84767
```

What are the coefficients

What can we conclude here?

```
Call:
lm(formula = weight ~ handedness, data = survey)
Residuals:
Min 1Q Median 3Q Max
-23.798 -10.649 -1.298 8.351 39.702
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 67.0000 4.1679 16.07 <2e-16 ***
handednessRight -0.7022 4.3958 -0.16 0.873
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 13.18 on 97 degrees of freedom
Multiple R-squared: 0.000263, Adjusted R-squared: -0.01004
F-statistic: 0.02552 on 1 and 97 DF, p-value: 0.8734
```