`lm`

versus `lmFit`

We have seen two ways to use *linear models*

- To model a single variable (CT of one gene, number of cells, methylation level, etc.) we use just R

- To model many genes in an array or RNAseq, we use the
*limma*package

The math is the same, the code is different

`lm`

versus `lmFit`

- Data for
`lm`

is a*data frame*with one colum for each variable- for example: diet/stress, tissue, age

- Data for
`lmFit`

is a*matrix*with genes/miRNA in the rows,*experiments*in the columns, and the expression level in each cell- We need an extra table describing the conditions of each experiment

`formula`

to `design.matrix`

The main difference between `lm`

and `lmFit`

is how to describe the model

- formula
- matrix

If we have `formula`

, we can get the design matrix using

It is not hard to make your own design matrix if you know the math

\[y_i = β_0 + β_1 x_i + e_i\]

\[\begin{pmatrix} y_1\\ ⋮ \\ y_i\\ ⋮ \\ y_n \end{pmatrix} = \begin{pmatrix} 1 & x_1\\ ⋮ & ⋮ \\ 1 & x_i\\ ⋮ & ⋮ \\ 1 & x_n\\ \end{pmatrix} \begin{pmatrix}β_0\\β_1\end{pmatrix} + \begin{pmatrix} e_1\\ ⋮ \\ e_i\\ ⋮ \\ e_n \end{pmatrix}\]

The fitted model has the smallest value for \(∑_i e_i^2\)

\[y_i = β_0 + β_1 x_{i,1} + β_2 x_{i,2} + e_i\]

\[\begin{pmatrix}
y_1\\
⋮ \\
y_i\\
⋮ \\
y_n
\end{pmatrix} = \begin{pmatrix}
1 & x_{1,1} & x_{1,2}\\
⋮ & ⋮ & ⋮ \\
1 & x_{i,1} & x_{i,2}\\
⋮ & ⋮ & ⋮ \\
1 & x_{n,1} & x_{n,2}\\
\end{pmatrix}
\begin{pmatrix}β_0\\β_1\\β_2\end{pmatrix} +
\begin{pmatrix}
e_1\\
⋮ \\
e_i\\
⋮ \\
e_n
\end{pmatrix}\] \(β_0\) (*intercept*) is the value of \(y_i\) when all \(x_{i,j}=0\)

\[y_i = β_0 + ∑_{j=1}^k β_j x_{i,j} + e_i\]

\[\begin{pmatrix}
y_1\\
⋮ \\
y_i\\
⋮ \\
y_n
\end{pmatrix} = \begin{pmatrix}
1 & x_{1,1} &\cdots& x_{1,j} &\cdots& x_{1,k}\\
⋮ & ⋮ & ⋱ & ⋮ & ⋱ & ⋮ \\
1 & x_{i,1} &\cdots& x_{i,j} &\cdots& x_{i,k}\\
⋮ & ⋮ & ⋱ & ⋮ & ⋱ & ⋮ \\
1 & x_{n,1} &\cdots& x_{n,j} &\cdots& x_{n,k}\\
\end{pmatrix}
\begin{pmatrix}β_0\\β_1\\⋮\\β_j\\⋮\\β_k\end{pmatrix} +
\begin{pmatrix}
e_1\\
⋮ \\
e_i\\
⋮ \\
e_n
\end{pmatrix}\] \(β_1,…,β_k\) are the *effects* of each independent variables

\[y_i = β_0 + β_1 x_{i,1} + β_2 x_{i,2} + e_i\]

\[\begin{pmatrix} y_1\\ ⋮ \\ y_i\\ ⋮ \\ y_n \end{pmatrix} = \begin{pmatrix} x_{1,1} & x_{1,2}\\ ⋮ & ⋮ \\ x_{i,1} & x_{i,2}\\ ⋮ & ⋮ \\ x_{n,1} & x_{n,2}\\ \end{pmatrix} \begin{pmatrix}β_1\\β_2\end{pmatrix} + \begin{pmatrix} e_1\\ ⋮ \\ e_i\\ ⋮ \\ e_n \end{pmatrix}\] This is useful when \(x_{i,j}\) represent a factor, but may not be realistic when \(x_{i,j}\) is a number

\[y_i = ∑_{j=1}^k β_j x_{i,j} + e_i\]

\[\begin{pmatrix} y_1\\ ⋮ \\ y_i\\ ⋮ \\ y_n \end{pmatrix} = \begin{pmatrix} x_{1,1} & x_{1,j} & x_{1,k}\\ ⋮ & ⋮ & ⋮ \\ x_{i,1} & x_{i,j} & x_{i,k}\\ ⋮ & ⋮ & ⋮ \\ x_{n,1} & x_{n,j} & x_{n,k}\\ \end{pmatrix} \begin{pmatrix}β_1\\⋮\\β_j\\⋮\\β_k\end{pmatrix} + \begin{pmatrix} e_1\\ ⋮ \\ e_i\\ ⋮ \\ e_n \end{pmatrix}\]

Let’s say that \(f\) is a factor with \(k\) levels \((f_i ∈\{l_1,…,l_k\})\)

We represent it with \(k\) columns

\[x_{i,j} = \begin{cases}1\text{ if }f_i\text{ is equal to }l_j\\ 0\text{ if }f_i\text{ is not equal to }l_j\end{cases}\]

\[y_i = β_1 x_{i,1} + β_2 x_{i,2} + e_i\]

\[\begin{pmatrix}
y_1\\
⋮ \\
y_n
\end{pmatrix} = \begin{pmatrix}
x_{1,1} & x_{1,2}\\
⋮ & ⋮ \\
x_{n,1} & x_{n,2}\\
\end{pmatrix}
\begin{pmatrix}β_1\\β_2\end{pmatrix} +
\begin{pmatrix}
e_1\\
⋮ \\
e_n
\end{pmatrix}\] If, for each *female* we have \(x_{i,1}=1\) and \(x_{i,2}=0,\)

then \(β_1\) will be the *mean* of \(y_i\) for all *female*

Let’s say that `x`

is a numeric variable

`y ~ x`

is \(y_i=β_0 + β_1 x_i + e_i\)`y ~ x+0`

is \(y_i=β_1 x_i + e_i\)`y ~ x1 + x2 + x3`

is \(y_i=β_0 + β_1 x_{i,1} + β_2 x_{i,2} + β_3 x_{i,3} + e_i\)`y ~ x1 + x2 + x3 + 0`

is \(y_i=β_1 x_{i,1} + β_2 x_{i,2} + β_3 x_{i,3} + e_i\)

Let’s say that `f`

is a factor with \(k\) levels \((f_i ∈\{l_1,…,l_k\})\)

`y ~ f+0`

is \(y_i=\sum_{j=1}^k β_j x_{i,j} +e_i,\) where \[x_{i,j} = \begin{cases}1\text{ if }f_i\text{ is equal to }l_j\\
0\text{ if }f_i\text{ is not equal to }l_j\end{cases}\]

```
fA fB fC
1 1 0 0
2 0 1 0
3 0 0 1
```

Now the *first* level is not encoded

`y ~ x`

is \(y_i=β_0 + \sum_{j=2}^k β_j x_{i,j} +e_i\)

```
(Intercept) fB fC
1 1 0 0
2 1 1 0
3 1 0 1
```

It is critical to choose well the *first* level

It is the *baseline*

\[y_i=β_0 + \sum_{j=2}^k β_j x_{i,j} +e_i\]

- \(β_0\) is the
*mean*of \(y_i\) for all cases \(f_i\) is \(l_1\) \[β_0=\text{mean}(y_i \vert f_i=l_1)\] - \(β_j\) is the
*difference of means*between \(l_j\) and \(l_1\) \[β_j=\text{mean}(y_i | f_i=l_j)-\text{mean}(y_i | f_i=l_1)\]

What if we are not interested in `C-A`

?

What if we want `C-B`

and `B-A`

?

(maybe each level is a different time, and we want to see the daily change)

A model with intercept will not work for us

Instead we use a model *without* intercept, and *contrasts*

In *limma* there is a function `makeContrasts(…, levels)`

.

We describe what we want, and what are all the levels

It encodes them in a matrix

```
Contrasts
Levels C - B B - A
A 0 -1
B -1 1
C 1 0
```

`lm`

The function `lm`

can take a `contrast=`

option, but it is not easy to use

- It only works
*with*intercept - If there are several
*factors*, they are handled independently

(Maybe there is an easier way that I don’t know)

Instead, we can just do the math

Start by declaring what you want to get

- \(β_1 = C - B\)
- \(β_2 = B - A\)

We need the same number of equations and variables

- \(β_3 = A\)

then we need to solve for \(A,B,C\)

\[\begin{aligned} β_1 &= C - B\\ β_2 &= B - A\\ β_3 &= A\end{aligned}\]

\[\begin{pmatrix} β_1\\ β_2\\ β_3\end{pmatrix}= \begin{pmatrix} 0 & -1 & 1\\ -1 & 1 & 0\\ 1 & 0 & 0\end{pmatrix} \begin{pmatrix}A\\B\\C\end{pmatrix}\]

\[\begin{pmatrix}A\\B\\C\end{pmatrix} = \begin{pmatrix} 0 & -1 & 1\\ -1 & 1 & 0\\ 1 & 0 & 0\end{pmatrix}^{-1} \begin{pmatrix}β_1\\β_2\\β_3\end{pmatrix}\]

Make a square `contrasts`

matrix

(`makeContrasts`

may help, but it is not enough)

```
A B C
C - B 0 -1 1
B - A -1 1 0
A 1 0 0
```

One row for each equation

(warning: this is different in `makeContrasts`

)

Now we can solve the equation system

```
C - B B - A A
A 0 0 1
B 0 1 1
C 1 1 1
```

This tell us how to change the design matrix to get what we want

First, make the design matrix

Since there are no intercepts, each column is a level.

Now, update the design multiplying by the contrasts

Here `%*%`

is *matrix multiplication* (i.e. rows by columns)

design matrix

```
A B C
1 1 0 0
2 1 0 0
3 1 0 0
4 0 1 0
5 0 1 0
6 0 1 0
7 0 0 1
8 0 0 1
9 0 0 1
```

design with contrasts

```
C - B B - A A
1 0 0 1
2 0 0 1
3 0 0 1
4 0 1 1
5 0 1 1
6 0 1 1
7 1 1 1
8 1 1 1
9 1 1 1
```

Finally we fit our model to the data

Here the formula contains `.`

, which means *all variables in the data*

We also have `+0`

meaning *no intercept*

All this will be useless if we cannot get *confidence intervals*

The key part is to evaluate the *standard errors*, which depend on the *variance*

Remember that \[var(λ X_j+ ρ X_k) = λ^2var(X_j) + ρ^2var(X_k) + 2 λ ρ cov(X_j,X_k)\] and \[cov(X_j,X_j) = var(X_j)\]

The matrix \(C = cov(X_j,X_k)\) has variances in the diagonal

We find that \[var(λ X_j+ ρ X_k) = (\lambda\quad\rho)⋅ C⋅ \begin{pmatrix}\lambda\\\rho\end{pmatrix}\]

In other words, the contrast matrix can also be used to update the covariance matrix

Here we have the differential expression of some genes

Replica 1 | Replica 2 | Replica 3 |
---|---|---|

-0.6356720 | 0.5445543 | 0.5056405 |

0.9198619 | -0.6887110 | -0.2273942 |

1.1870043 | 1.0710029 | 1.3180957 |

0.1376069 | 1.7086511 | 1.1611300 |

0.8551033 | -1.0060231 | 0.4222059 |

There are three *biological* replicas for each gene

The values of first gene are

`[1] -0.6356720 0.5445543 0.5056405`

The mean is

`[1] 0.1381743`

The standard deviation is

`[1] 0.6704529`

We have 𝑛=3 values, and we are estimating 1 value (the mean)

Thus, we have 3-1=2 degrees of freedom

The t distribution for 95% and 2 degrees of freedom is

`[1] 4.302653`

Thus, the 95%-confidence interval for the expression is

`[1] -2.746552 3.022900`

The interval contains 0, so it seems that the gene **is not** differentially expressed

The values of first gene are

`[1] 1.187004 1.071003 1.318096`

The mean is

`[1] 1.192034`

The standard deviation is

`[1] 0.1236232`

The t distribution for 95% and 2 degrees of freedom is

`[1] 4.302653`

Thus, the 95%-confidence interval for the expression is

`[1] 0.6601268 1.7239418`

The interval **does not** contain 0, so it seems that the gene **is** differentially expressed

The t distribution for 99% and 2 degrees of freedom is

`[1] 9.924843`

Thus, the 99%-confidence interval for the expression is

`[1] -0.03490616 2.41897477`

Now the interval contains 0, so it seems that the gene **is not** differentially expressed