# Systems Biology

## lm versus lmFit

We have seen two ways to use linear models

1. To model a single variable (CT of one gene, number of cells, methylation level, etc.) we use just R
fitted.model <- lm(formula, data)
1. To model many genes in an array or RNAseq, we use the limma package
fitted.model <- lmFit(design.matrix, data)

The math is the same, the code is different

## lm versus lmFit

• Data for lm is a data frame with one colum for each variable
• for example: diet/stress, tissue, age
• Data for lmFit is a matrix with genes/miRNA in the rows, experiments in the columns, and the expression level in each cell
• We need an extra table describing the conditions of each experiment

## From formula to design.matrix

The main difference between lm and lmFit is how to describe the model

• formula
• matrix

If we have formula, we can get the design matrix using

design.matrix <- model.matrix(formula)

It is not hard to make your own design matrix if you know the math

## Math to matrix

$y_i = β_0 + β_1 x_i + e_i$

$\begin{pmatrix} y_1\\ ⋮ \\ y_i\\ ⋮ \\ y_n \end{pmatrix} = \begin{pmatrix} 1 & x_1\\ ⋮ & ⋮ \\ 1 & x_i\\ ⋮ & ⋮ \\ 1 & x_n\\ \end{pmatrix} \begin{pmatrix}β_0\\β_1\end{pmatrix} + \begin{pmatrix} e_1\\ ⋮ \\ e_i\\ ⋮ \\ e_n \end{pmatrix}$

The fitted model has the smallest value for $$∑_i e_i^2$$

## Two independent variables

$y_i = β_0 + β_1 x_{i,1} + β_2 x_{i,2} + e_i$

$\begin{pmatrix} y_1\\ ⋮ \\ y_i\\ ⋮ \\ y_n \end{pmatrix} = \begin{pmatrix} 1 & x_{1,1} & x_{1,2}\\ ⋮ & ⋮ & ⋮ \\ 1 & x_{i,1} & x_{i,2}\\ ⋮ & ⋮ & ⋮ \\ 1 & x_{n,1} & x_{n,2}\\ \end{pmatrix} \begin{pmatrix}β_0\\β_1\\β_2\end{pmatrix} + \begin{pmatrix} e_1\\ ⋮ \\ e_i\\ ⋮ \\ e_n \end{pmatrix}$ $$β_0$$ (intercept) is the value of $$y_i$$ when all $$x_{i,j}=0$$

## General case

$y_i = β_0 + ∑_{j=1}^k β_j x_{i,j} + e_i$

$\begin{pmatrix} y_1\\ ⋮ \\ y_i\\ ⋮ \\ y_n \end{pmatrix} = \begin{pmatrix} 1 & x_{1,1} &\cdots& x_{1,j} &\cdots& x_{1,k}\\ ⋮ & ⋮ & ⋱ & ⋮ & ⋱ & ⋮ \\ 1 & x_{i,1} &\cdots& x_{i,j} &\cdots& x_{i,k}\\ ⋮ & ⋮ & ⋱ & ⋮ & ⋱ & ⋮ \\ 1 & x_{n,1} &\cdots& x_{n,j} &\cdots& x_{n,k}\\ \end{pmatrix} \begin{pmatrix}β_0\\β_1\\⋮\\β_j\\⋮\\β_k\end{pmatrix} + \begin{pmatrix} e_1\\ ⋮ \\ e_i\\ ⋮ \\ e_n \end{pmatrix}$ $$β_1,…,β_k$$ are the effects of each independent variables

## Model without intercept

$y_i = β_0 + β_1 x_{i,1} + β_2 x_{i,2} + e_i$

$\begin{pmatrix} y_1\\ ⋮ \\ y_i\\ ⋮ \\ y_n \end{pmatrix} = \begin{pmatrix} x_{1,1} & x_{1,2}\\ ⋮ & ⋮ \\ x_{i,1} & x_{i,2}\\ ⋮ & ⋮ \\ x_{n,1} & x_{n,2}\\ \end{pmatrix} \begin{pmatrix}β_1\\β_2\end{pmatrix} + \begin{pmatrix} e_1\\ ⋮ \\ e_i\\ ⋮ \\ e_n \end{pmatrix}$ This is useful when $$x_{i,j}$$ represent a factor, but may not be realistic when $$x_{i,j}$$ is a number

## General case without intercept

$y_i = ∑_{j=1}^k β_j x_{i,j} + e_i$

$\begin{pmatrix} y_1\\ ⋮ \\ y_i\\ ⋮ \\ y_n \end{pmatrix} = \begin{pmatrix} x_{1,1} & x_{1,j} & x_{1,k}\\ ⋮ & ⋮ & ⋮ \\ x_{i,1} & x_{i,j} & x_{i,k}\\ ⋮ & ⋮ & ⋮ \\ x_{n,1} & x_{n,j} & x_{n,k}\\ \end{pmatrix} \begin{pmatrix}β_1\\⋮\\β_j\\⋮\\β_k\end{pmatrix} + \begin{pmatrix} e_1\\ ⋮ \\ e_i\\ ⋮ \\ e_n \end{pmatrix}$

## Factors

Let’s say that $$f$$ is a factor with $$k$$ levels $$(f_i ∈\{l_1,…,l_k\})$$

We represent it with $$k$$ columns

$x_{i,j} = \begin{cases}1\text{ if }f_i\text{ is equal to }l_j\\ 0\text{ if }f_i\text{ is not equal to }l_j\end{cases}$

## Example: sex

$y_i = β_1 x_{i,1} + β_2 x_{i,2} + e_i$

$\begin{pmatrix} y_1\\ ⋮ \\ y_n \end{pmatrix} = \begin{pmatrix} x_{1,1} & x_{1,2}\\ ⋮ & ⋮ \\ x_{n,1} & x_{n,2}\\ \end{pmatrix} \begin{pmatrix}β_1\\β_2\end{pmatrix} + \begin{pmatrix} e_1\\ ⋮ \\ e_n \end{pmatrix}$ If, for each female we have $$x_{i,1}=1$$ and $$x_{i,2}=0,$$
then $$β_1$$ will be the mean of $$y_i$$ for all female

## Formula to math

Let’s say that x is a numeric variable

• y ~ x is $$y_i=β_0 + β_1 x_i + e_i$$

• y ~ x+0 is $$y_i=β_1 x_i + e_i$$

• y ~ x1 + x2 + x3 is $$y_i=β_0 + β_1 x_{i,1} + β_2 x_{i,2} + β_3 x_{i,3} + e_i$$

• y ~ x1 + x2 + x3 + 0 is $$y_i=β_1 x_{i,1} + β_2 x_{i,2} + β_3 x_{i,3} + e_i$$

## Formulas with factors

Let’s say that f is a factor with $$k$$ levels $$(f_i ∈\{l_1,…,l_k\})$$

y ~ f+0 is $$y_i=\sum_{j=1}^k β_j x_{i,j} +e_i,$$ where $x_{i,j} = \begin{cases}1\text{ if }f_i\text{ is equal to }l_j\\ 0\text{ if }f_i\text{ is not equal to }l_j\end{cases}$

f <- factor(c("A","B","C"))
model.matrix(~f+0) |> as.data.frame()
  fA fB fC
1  1  0  0
2  0  1  0
3  0  0  1

## Formulas with factors and intercept

Now the first level is not encoded

y ~ x is $$y_i=β_0 + \sum_{j=2}^k β_j x_{i,j} +e_i$$

f <- factor(c("A","B","C"))
model.matrix(~f) |> as.data.frame()
  (Intercept) fB fC
1           1  0  0
2           1  1  0
3           1  0  1

It is critical to choose well the first level
It is the baseline

## Interpretation

$y_i=β_0 + \sum_{j=2}^k β_j x_{i,j} +e_i$

• $$β_0$$ is the mean of $$y_i$$ for all cases $$f_i$$ is $$l_1$$ $β_0=\text{mean}(y_i \vert f_i=l_1)$
• $$β_j$$ is the difference of means between $$l_j$$ and $$l_1$$ $β_j=\text{mean}(y_i | f_i=l_j)-\text{mean}(y_i | f_i=l_1)$

# Contrasts

## An easiest way to handle factors

What if we are not interested in C-A?
What if we want C-B and B-A?
(maybe each level is a different time, and we want to see the daily change)

A model with intercept will not work for us

Instead we use a model without intercept, and contrasts

## Contrast describe what we want

In limma there is a function makeContrasts(…, levels).
We describe what we want, and what are all the levels
It encodes them in a matrix

makeContrasts(C-B, B-A, levels=f)
      Contrasts
Levels C - B B - A
A     0    -1
B    -1     1
C     1     0

## Using contrasts in Limma

design.matrix <- model.matrix(~ f + 0)
fit <- lmFit(data, design.matrix)

contrasts <- makeContrasts(C-B, B-A, levels=design.matrix)
fit.contrasts <- contrasts.fit(fit, contrasts)
eb <- eBayes(fit.contrasts)

## Using contrast in plain lm

The function lm can take a contrast= option, but it is not easy to use

• It only works with intercept
• If there are several factors, they are handled independently

(Maybe there is an easier way that I don’t know)

Instead, we can just do the math

## Making contrasts manually

Start by declaring what you want to get

• $$β_1 = C - B$$
• $$β_2 = B - A$$

We need the same number of equations and variables

• $$β_3 = A$$

then we need to solve for $$A,B,C$$

## As a matrix

\begin{aligned} β_1 &= C - B\\ β_2 &= B - A\\ β_3 &= A\end{aligned}

$\begin{pmatrix} β_1\\ β_2\\ β_3\end{pmatrix}= \begin{pmatrix} 0 & -1 & 1\\ -1 & 1 & 0\\ 1 & 0 & 0\end{pmatrix} \begin{pmatrix}A\\B\\C\end{pmatrix}$

## Solving for $$A,B,C$$

$\begin{pmatrix}A\\B\\C\end{pmatrix} = \begin{pmatrix} 0 & -1 & 1\\ -1 & 1 & 0\\ 1 & 0 & 0\end{pmatrix}^{-1} \begin{pmatrix}β_1\\β_2\\β_3\end{pmatrix}$

## Doing math in R

Make a square contrasts matrix
(makeContrasts may help, but it is not enough)

contrasts
       A  B C
C - B  0 -1 1
B - A -1  1 0
A      1  0 0

One row for each equation
(warning: this is different in makeContrasts)

## Solving in R

Now we can solve the equation system

solve(contrasts)
  C - B B - A A
A     0     0 1
B     0     1 1
C     1     1 1

This tell us how to change the design matrix to get what we want

## Putting it all together

First, make the design matrix

design.matrix <- model.matrix(~ f + 0)

Since there are no intercepts, each column is a level.
Now, update the design multiplying by the contrasts

design.with.contrasts <- design.matrix %*% solve(contrasts)

Here %*% is matrix multiplication (i.e. rows by columns)

## Before and after

design matrix

  A B C
1 1 0 0
2 1 0 0
3 1 0 0
4 0 1 0
5 0 1 0
6 0 1 0
7 0 0 1
8 0 0 1
9 0 0 1

design with contrasts

  C - B B - A A
1     0     0 1
2     0     0 1
3     0     0 1
4     0     1 1
5     0     1 1
6     0     1 1
7     1     1 1
8     1     1 1
9     1     1 1

## Fitting the linear model

Finally we fit our model to the data

lm(y ~ . + 0, data=design.with.contrasts)

Here the formula contains ., which means all variables in the data

We also have +0 meaning no intercept

## Confidence intervals and $$p$$-values

All this will be useless if we cannot get confidence intervals

The key part is to evaluate the standard errors, which depend on the variance

Remember that $var(λ X_j+ ρ X_k) = λ^2var(X_j) + ρ^2var(X_k) + 2 λ ρ cov(X_j,X_k)$ and $cov(X_j,X_j) = var(X_j)$

## Covariance matrix

The matrix $$C = cov(X_j,X_k)$$ has variances in the diagonal

We find that $var(λ X_j+ ρ X_k) = (\lambda\quad\rho)⋅ C⋅ \begin{pmatrix}\lambda\\\rho\end{pmatrix}$

In other words, the contrast matrix can also be used to update the covariance matrix

# Other details

## Log ratios

y1 <- sample(1:100, size=1000, replace=TRUE)
y2 <- sample(1:100, size=1000, replace=TRUE)
hist(y2/y1, nclass=30)
abline(v=1, col="red", lwd=3)

## Log ratios

hist(log(y2) - log(y1), nclass=30)
abline(v=0, col="red", lwd=3)

# Power

## Can we see the change

Here we have the differential expression of some genes

Replica 1 Replica 2 Replica 3
-0.6356720 0.5445543 0.5056405
0.9198619 -0.6887110 -0.2273942
1.1870043 1.0710029 1.3180957
0.1376069 1.7086511 1.1611300
0.8551033 -1.0060231 0.4222059

There are three biological replicas for each gene

## Case 1

The values of first gene are

[1] -0.6356720  0.5445543  0.5056405

The mean is

[1] 0.1381743

The standard deviation is

[1] 0.6704529

## Interval for Case 1

We have 𝑛=3 values, and we are estimating 1 value (the mean)

Thus, we have 3-1=2 degrees of freedom

The t distribution for 95% and 2 degrees of freedom is

[1] 4.302653

Thus, the 95%-confidence interval for the expression is

[1] -2.746552  3.022900

The interval contains 0, so it seems that the gene is not differentially expressed

## Case 2

The values of first gene are

[1] 1.187004 1.071003 1.318096

The mean is

[1] 1.192034

The standard deviation is

[1] 0.1236232

## Interval for Case 2

The t distribution for 95% and 2 degrees of freedom is

[1] 4.302653

Thus, the 95%-confidence interval for the expression is

[1] 0.6601268 1.7239418

The interval does not contain 0, so it seems that the gene is differentially expressed

## Redoing at 99% confidence

The t distribution for 99% and 2 degrees of freedom is

[1] 9.924843

Thus, the 99%-confidence interval for the expression is

[1] -0.03490616  2.41897477

Now the interval contains 0, so it seems that the gene is not differentially expressed