# Systems Biology

## Summary of last class

Measured = Real ⊕ Instrument ⊕ Noise ⊕ Diversity $y_m(x,i,t) = f(x)⊕I(x,t)⊕n(i)⊕b(x,i)$

Normalized = Real ⊕ Noise ⊕ Diversity $y(x,i) = f(x)⊕n(i)⊕b(x,i)$

## Weight

$w_N(h,s,i) = w(h,s)⊕n(i)⊕b(h,s,i)$

$\underbrace{y_i}_{w_N(h,s,i)} = \underbrace{β_0 + β_1 x_i + β_2 s_i}_{w(h,s)} + \underbrace{e_i}_{n(i)⊕b(h,s,i)}$

To make this class more general, we will use $$x_i$$ to represent height

This way the formulas will not be only about weight and height

## The error term

The $$e_i$$ values will be important later

They are called “errors” or “residuals”

Since there is no systematic error, we must have $\sum_i e_i=0$ (otherwise we can change $$\beta_0$$ or $$\beta_1$$)

## Data from CMB1

sex birthdate height weight handedness hand_span
1 Male 1993-02-01 179 67 Right 15
2 Female 1998-05-21 168 55 Right 14
4 Male 1998-08-29 170 74 Right 25
5 Female 1998-05-03 162 68 Right 13
6 Female 1995-10-09 167 58 Right 18
7 Female 1997-09-19 174 72 Right 16
8 Male 1997-11-27 180 68 Right 19
9 Female 1999-01-02 162 58 Right 19
10 Female 1998-10-02 172 55 Right 20
11 Male 1997-05-18 181 81 Right 20

## Basic descriptive statistics: Mean

This is the simplest model for 𝐲 $\text{mean}(𝐲)=\overline{𝐲}=\frac{1}{n}\sum_i y_i$

mean(survey$weight) [1] 66.36869 ## Variance: how bad is this model This is the mean square error of the model $MSE_0=\text{var}(𝐲) =\frac{1}{n}\sum_i (y_i - \overline{𝐲})^2$ mean(survey$weight^2)- mean(survey\$weight)^2
[1] 170.2504

## Modeling weight knowing height

It is easy to see that taller people will be heavier

You need more bones to be taller, and more muscles to move

plot(weight ~ height, data=survey)

## Seems like a straight line

We will model using this expression $y_i = β_0 + β_1 x_i + e_i$ where \begin{aligned} β_0 &= \overline{𝐲} - β_1 \overline{𝐱}\\ β_1 &= \frac{\text{cov}(𝐱, 𝐲)}{\text{var}(𝐱)} \end{aligned}

## In R

model_1 <- lm(weight ~ height, data=survey)
coef(model_1)
(Intercept)      height
-84.007888    0.882157 

## Prediction

plot(weight ~ height, data=survey)
points(predict(model_1) ~ height, data=survey, col="red", pch=16)

## How bad is this model?

This time instead of $MSE_0=\frac{1}{n}\sum_i (y_i - \overline{𝐲})^2$ we have $MSE_1=\text{var}(𝐲) - \frac{\text{cov}^2(𝐱,𝐲)}{\text{var}(𝐱)}$

This already shows that the mean square error is better in model 1 than in model 0

## Relative improvement

The variance in the new model is better, but how much? $\frac{MSE_0-MSE_1}{MSE_0}=\frac{\text{cov}^2(𝐱,𝐲)}{\text{var}(𝐱)\text{var}(𝐲)}$ This number represents the percentage of the original variance that is explained by the new model

The name of this number is $$R^2$$

## Correlation

The Pearson correlation coefficient between two variables is $r=\frac{\text{cov}(𝐱,𝐲)}{\text{sd}(𝐱)\text{sd}(𝐲)}$ so we have in this case that $R^2 = r^2$ This is valid for linear models with a single independent variable. It will not be valid for larger models

# The interesting part

## These are random values

Our results include a random component (the $$e_i$$)

Thus, the values of $$β_0$$ and $$β_1$$ are also random

But they are not far away from the “real” ones

More details in the next class

## Confidence intervals

We can get a 95% or 99% confidence interval for the values

confint(model_1, level=0.99)
                   0.5 %     99.5 %
(Intercept) -132.7312228 -35.284553
height         0.5967649   1.167549

If the confidence interval does not contains 0, then the real value is not 0

“The evidence shows that the coefficient is not 0”

## On the other case

If the confidence interval contains 0, then the real value may be 0

“We do not have evidence that the coefficient is not 0”

## p-values

summary(model_1)

Call:
lm(formula = weight ~ height, data = survey)

Residuals:
Min      1Q  Median      3Q     Max
-18.016  -7.225   0.216   7.396  35.630

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -84.0079    18.5438  -4.530 1.68e-05 ***
height        0.8822     0.1086   8.122 1.48e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.17 on 97 degrees of freedom
Multiple R-squared:  0.4048,    Adjusted R-squared:  0.3986
F-statistic: 65.96 on 1 and 97 DF,  p-value: 1.479e-12

## This is a Student’s t-test

We will see that the values $$β_0$$ and $$β_1$$ follow a Student’s t distribution

Therefore we can use a t-test to see if they are different from a fixed value

In this case we let the computer do it for us

## Confidence interval for the line

plot(weight ~ height, data=survey)
pred <- predict(model_1, interval = "confidence")
points(pred[,"fit"] ~ height, data=survey, pch=16, col="red")
points(pred[,"lwr"] ~ height, data=survey, pch=16, col="blue")
points(pred[,"upr"] ~ height, data=survey, pch=16, col="purple")

## Confidence interval for the prediction

plot(weight ~ height, data=survey)
pred <- predict(model_1, interval = "prediction", level=0.95)
points(pred[,"fit"] ~ height, data=survey, pch=16, col="red")
points(pred[,"lwr"] ~ height, data=survey, pch=16, col="blue")
points(pred[,"upr"] ~ height, data=survey, pch=16, col="purple")

# Sex

## Modeling weight knowing sex

plot(weight ~ sex, data=survey)

Sex is a factor, so we get a box plot

## Two groups

We still have $$n$$ individuals, but now we can split them in two groups

The groups are called “female” and “male”

There are $$n_0$$ females and $$n_1$$ males, so $$n=n_0+n_1.$$

## How to model a factor?

The easiest way is to use a number $s_i=\begin{cases}0\quad\text{if person }i\text{ is female}\\ 1\quad\text{if person }i\text{ is male}\end{cases}$

It is easy to see that $$\sum_i s_i = n_1$$

## How does it work

We have now $y_i = β_0 + β_1 s_i + e_i$ So for all the female individuals, we have $y_i = β_0 + e_i$ Taking averages we have $\frac{1}{n_0}\sum_{i\text{ female}}y_i = β_0$ so $$β_0$$ is the average weight of female $β_0=\text{mean}(𝐲\text{ female})$

## Case of male

So for all the male individuals, we have $y_i = β_0 + β_1 + e_i$ Taking averages we have $\frac{1}{n_1}\sum_{i\text{ male}}y_i = β_0+ β_1$ In other words $β_0 + β_1=\text{mean}(𝐲\text{ male})$

## Interpretation of $$β$$

Wh have $β_0=\text{mean}(𝐲\text{ female})$ and $β_0 + β_1=\text{mean}(𝐲\text{ male})$ Replacing the value of $$β_0$$ we have $β_1=\text{mean}(𝐲\text{ male})-\text{mean}(𝐲\text{ female})$

## In practice

model_2 <- lm(weight ~ sex, data=survey)
coef(model_2)
(Intercept)     sexMale
59.57937    18.67063 
confint(model_2, level = 0.99)
               0.5 %   99.5 %
(Intercept) 56.41406 62.74467
sexMale     13.42158 23.91969

(Intercept) is the average female weight
sexMale is the average extra weight for male

## p-values

summary(model_2)

Call:
lm(formula = weight ~ sex, data = survey)

Residuals:
Min      1Q  Median      3Q     Max
-19.250  -5.579  -1.579   5.421  27.750

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   59.579      1.205  49.456  < 2e-16 ***
sexMale       18.671      1.998   9.346 3.47e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.562 on 97 degrees of freedom
Multiple R-squared:  0.4738,    Adjusted R-squared:  0.4684
F-statistic: 87.34 on 1 and 97 DF,  p-value: 3.47e-15

model_3 <- lm(weight ~ handedness, data=survey)
coef(model_3)
    (Intercept) handednessRight
67.0000000      -0.7022472 
confint(model_3, level = 0.99)
                    0.5 %   99.5 %
(Intercept)      56.04894 77.95106
handednessRight -12.25216 10.84767

What are the coefficients

What can we conclude here?

## p-values

summary(model_3)

Call:
lm(formula = weight ~ handedness, data = survey)

Residuals:
Min      1Q  Median      3Q     Max
-23.798 -10.649  -1.298   8.351  39.702

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)      67.0000     4.1679   16.07   <2e-16 ***
handednessRight  -0.7022     4.3958   -0.16    0.873
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.18 on 97 degrees of freedom
Multiple R-squared:  0.000263,  Adjusted R-squared:  -0.01004
F-statistic: 0.02552 on 1 and 97 DF,  p-value: 0.8734