Class 9: Linear models… again

Systems Biology

Andrés Aravena, PhD

October 26, 2021

`lm` versus `lmFit`

We have seen two ways to use linear models

To model a single variable (CT of one gene, number of cells, methylation level, etc.) we use just R

fitted.model <- lm(formula, data)

To model many genes in an array or RNAseq, we use the limma package

fitted.model <- lmFit(design.matrix, data)

The math is the same, the code is different

`lm` versus `lmFit`

Data for lm is a data frame with one colum for each variable
- for example: diet/stress, tissue, age
Data for lmFit is a matrix with genes/miRNA in the rows, experiments in the columns, and the expression level in each cell
- We need an extra table describing the conditions of each experiment

From `formula` to `design.matrix`

The main difference between lm and lmFit is how to describe the model

formula
matrix

If we have formula, we can get the design matrix using

design.matrix <- model.matrix(formula)

It is not hard to make your own design matrix if you know the math

Math to matrix

\[y_i = β_0 + β_1 x_i + e_i\]

\[\begin{pmatrix} y_1\\ ⋮ \\ y_i\\ ⋮ \\ y_n \end{pmatrix} = \begin{pmatrix} 1 & x_1\\ ⋮ & ⋮ \\ 1 & x_i\\ ⋮ & ⋮ \\ 1 & x_n\\ \end{pmatrix} \begin{pmatrix}β_0\\β_1\end{pmatrix} + \begin{pmatrix} e_1\\ ⋮ \\ e_i\\ ⋮ \\ e_n \end{pmatrix}\]

The fitted model has the smallest value for \(∑_i e_i^2\)

Two independent variables

\[y_i = β_0 + β_1 x_{i,1} + β_2 x_{i,2} + e_i\]

\[\begin{pmatrix} y_1\\ ⋮ \\ y_i\\ ⋮ \\ y_n \end{pmatrix} = \begin{pmatrix} 1 & x_{1,1} & x_{1,2}\\ ⋮ & ⋮ & ⋮ \\ 1 & x_{i,1} & x_{i,2}\\ ⋮ & ⋮ & ⋮ \\ 1 & x_{n,1} & x_{n,2}\\ \end{pmatrix} \begin{pmatrix}β_0\\β_1\\β_2\end{pmatrix} + \begin{pmatrix} e_1\\ ⋮ \\ e_i\\ ⋮ \\ e_n \end{pmatrix}\] \(β_0\) (intercept) is the value of \(y_i\) when all \(x_{i,j}=0\)

General case

\[y_i = β_0 + ∑_{j=1}^k β_j x_{i,j} + e_i\]

\[\begin{pmatrix} y_1\\ ⋮ \\ y_i\\ ⋮ \\ y_n \end{pmatrix} = \begin{pmatrix} 1 & x_{1,1} &\cdots& x_{1,j} &\cdots& x_{1,k}\\ ⋮ & ⋮ & ⋱ & ⋮ & ⋱ & ⋮ \\ 1 & x_{i,1} &\cdots& x_{i,j} &\cdots& x_{i,k}\\ ⋮ & ⋮ & ⋱ & ⋮ & ⋱ & ⋮ \\ 1 & x_{n,1} &\cdots& x_{n,j} &\cdots& x_{n,k}\\ \end{pmatrix} \begin{pmatrix}β_0\\β_1\\⋮\\β_j\\⋮\\β_k\end{pmatrix} + \begin{pmatrix} e_1\\ ⋮ \\ e_i\\ ⋮ \\ e_n \end{pmatrix}\] \(β_1,…,β_k\) are the effects of each independent variables

Model without intercept

\[y_i = β_0 + β_1 x_{i,1} + β_2 x_{i,2} + e_i\]

\[\begin{pmatrix} y_1\\ ⋮ \\ y_i\\ ⋮ \\ y_n \end{pmatrix} = \begin{pmatrix} x_{1,1} & x_{1,2}\\ ⋮ & ⋮ \\ x_{i,1} & x_{i,2}\\ ⋮ & ⋮ \\ x_{n,1} & x_{n,2}\\ \end{pmatrix} \begin{pmatrix}β_1\\β_2\end{pmatrix} + \begin{pmatrix} e_1\\ ⋮ \\ e_i\\ ⋮ \\ e_n \end{pmatrix}\] This is useful when \(x_{i,j}\) represent a factor, but may not be realistic when \(x_{i,j}\) is a number

General case without intercept

\[y_i = ∑_{j=1}^k β_j x_{i,j} + e_i\]

\[\begin{pmatrix} y_1\\ ⋮ \\ y_i\\ ⋮ \\ y_n \end{pmatrix} = \begin{pmatrix} x_{1,1} & x_{1,j} & x_{1,k}\\ ⋮ & ⋮ & ⋮ \\ x_{i,1} & x_{i,j} & x_{i,k}\\ ⋮ & ⋮ & ⋮ \\ x_{n,1} & x_{n,j} & x_{n,k}\\ \end{pmatrix} \begin{pmatrix}β_1\\⋮\\β_j\\⋮\\β_k\end{pmatrix} + \begin{pmatrix} e_1\\ ⋮ \\ e_i\\ ⋮ \\ e_n \end{pmatrix}\]

Factors

Let’s say that \(f\) is a factor with \(k\) levels \((f_i ∈\{l_1,…,l_k\})\)

We represent it with \(k\) columns

\[x_{i,j} = \begin{cases}1\text{ if }f_i\text{ is equal to }l_j\\ 0\text{ if }f_i\text{ is not equal to }l_j\end{cases}\]

Example: sex

\[y_i = β_1 x_{i,1} + β_2 x_{i,2} + e_i\]

\[\begin{pmatrix} y_1\\ ⋮ \\ y_n \end{pmatrix} = \begin{pmatrix} x_{1,1} & x_{1,2}\\ ⋮ & ⋮ \\ x_{n,1} & x_{n,2}\\ \end{pmatrix} \begin{pmatrix}β_1\\β_2\end{pmatrix} + \begin{pmatrix} e_1\\ ⋮ \\ e_n \end{pmatrix}\] If, for each female we have \(x_{i,1}=1\) and \(x_{i,2}=0,\)
then \(β_1\) will be the mean of \(y_i\) for all female

Formula to math

Let’s say that x is a numeric variable

y ~ x is \(y_i=β_0 + β_1 x_i + e_i\)
y ~ x+0 is \(y_i=β_1 x_i + e_i\)
y ~ x1 + x2 + x3 is \(y_i=β_0 + β_1 x_{i,1} + β_2 x_{i,2} + β_3 x_{i,3} + e_i\)
y ~ x1 + x2 + x3 + 0 is \(y_i=β_1 x_{i,1} + β_2 x_{i,2} + β_3 x_{i,3} + e_i\)

Formulas with factors

Let’s say that f is a factor with \(k\) levels \((f_i ∈\{l_1,…,l_k\})\)

y ~ f+0 is \(y_i=\sum_{j=1}^k β_j x_{i,j} +e_i,\) where \[x_{i,j} = \begin{cases}1\text{ if }f_i\text{ is equal to }l_j\\ 0\text{ if }f_i\text{ is not equal to }l_j\end{cases}\]

f <- factor(c("A","B","C"))
model.matrix(~f+0) |> as.data.frame()

Formulas with factors and intercept

Now the first level is not encoded

y ~ x is \(y_i=β_0 + \sum_{j=2}^k β_j x_{i,j} +e_i\)

f <- factor(c("A","B","C"))
model.matrix(~f) |> as.data.frame()

  (Intercept) fB fC
1           1  0  0
2           1  1  0
3           1  0  1

It is critical to choose well the first level
It is the baseline

Interpretation

\[y_i=β_0 + \sum_{j=2}^k β_j x_{i,j} +e_i\]

\(β_0\) is the mean of \(y_i\) for all cases \(f_i\) is \(l_1\) \[β_0=\text{mean}(y_i \vert f_i=l_1)\]
\(β_j\) is the difference of means between \(l_j\) and \(l_1\) \[β_j=\text{mean}(y_i | f_i=l_j)-\text{mean}(y_i | f_i=l_1)\]

Contrasts

An easiest way to handle factors

What if we are not interested in C-A?
What if we want C-B and B-A?
(maybe each level is a different time, and we want to see the daily change)

A model with intercept will not work for us

Instead we use a model without intercept, and contrasts

Contrast describe what we want

In limma there is a function makeContrasts(…, levels).
We describe what we want, and what are all the levels
It encodes them in a matrix

makeContrasts(C-B, B-A, levels=f)

      Contrasts
Levels C - B B - A
     A     0    -1
     B    -1     1
     C     1     0

Using contrasts in Limma

design.matrix <- model.matrix(~ f + 0)
fit <- lmFit(data, design.matrix)

contrasts <- makeContrasts(C-B, B-A, levels=design.matrix)
fit.contrasts <- contrasts.fit(fit, contrasts)
eb <- eBayes(fit.contrasts)

Using contrast in plain `lm`

The function lm can take a contrast= option, but it is not easy to use

It only works with intercept
If there are several factors, they are handled independently

(Maybe there is an easier way that I don’t know)

Instead, we can just do the math

Making contrasts manually

Start by declaring what you want to get

\(β_1 = C - B\)
\(β_2 = B - A\)

We need the same number of equations and variables

\(β_3 = A\)

then we need to solve for \(A,B,C\)

As a matrix

\[\begin{aligned} β_1 &= C - B\\ β_2 &= B - A\\ β_3 &= A\end{aligned}\]

\[\begin{pmatrix} β_1\\ β_2\\ β_3\end{pmatrix}= \begin{pmatrix} 0 & -1 & 1\\ -1 & 1 & 0\\ 1 & 0 & 0\end{pmatrix} \begin{pmatrix}A\\B\\C\end{pmatrix}\]

Solving for \(A,B,C\)

\[\begin{pmatrix}A\\B\\C\end{pmatrix} = \begin{pmatrix} 0 & -1 & 1\\ -1 & 1 & 0\\ 1 & 0 & 0\end{pmatrix}^{-1} \begin{pmatrix}β_1\\β_2\\β_3\end{pmatrix}\]

Doing math in R

Make a square contrasts matrix
(makeContrasts may help, but it is not enough)

contrasts

       A  B C
C - B  0 -1 1
B - A -1  1 0
A      1  0 0

One row for each equation
(warning: this is different in makeContrasts)

Solving in R

Now we can solve the equation system

solve(contrasts)

  C - B B - A A
A     0     0 1
B     0     1 1
C     1     1 1

This tell us how to change the design matrix to get what we want

Putting it all together

First, make the design matrix

design.matrix <- model.matrix(~ f + 0)

Since there are no intercepts, each column is a level.
Now, update the design multiplying by the contrasts

design.with.contrasts <- design.matrix %*% solve(contrasts)

Here %*% is matrix multiplication (i.e. rows by columns)

Before and after

design matrix

design with contrasts

  C - B B - A A
1     0     0 1
2     0     0 1
3     0     0 1
4     0     1 1
5     0     1 1
6     0     1 1
7     1     1 1
8     1     1 1
9     1     1 1

Fitting the linear model

Finally we fit our model to the data

lm(y ~ . + 0, data=design.with.contrasts)

Here the formula contains ., which means all variables in the data

We also have +0 meaning no intercept

Confidence intervals and \(p\)-values

All this will be useless if we cannot get confidence intervals

The key part is to evaluate the standard errors, which depend on the variance

Remember that \[var(λ X_j+ ρ X_k) = λ^2var(X_j) + ρ^2var(X_k) + 2 λ ρ cov(X_j,X_k)\] and \[cov(X_j,X_j) = var(X_j)\]

Covariance matrix

The matrix \(C = cov(X_j,X_k)\) has variances in the diagonal

We find that \[var(λ X_j+ ρ X_k) = (\lambda\quad\rho)⋅ C⋅ \begin{pmatrix}\lambda\\\rho\end{pmatrix}\]

In other words, the contrast matrix can also be used to update the covariance matrix

Other details

Log ratios

y1 <- sample(1:100, size=1000, replace=TRUE)
y2 <- sample(1:100, size=1000, replace=TRUE)
hist(y2/y1, nclass=30)
abline(v=1, col="red", lwd=3)

Log ratios

hist(log(y2) - log(y1), nclass=30)
abline(v=0, col="red", lwd=3)

Power

Can we see the change

Here we have the differential expression of some genes

Replica 1	Replica 2	Replica 3
-0.6356720	0.5445543	0.5056405
0.9198619	-0.6887110	-0.2273942
1.1870043	1.0710029	1.3180957
0.1376069	1.7086511	1.1611300
0.8551033	-1.0060231	0.4222059

There are three biological replicas for each gene

Case 1

The values of first gene are

[1] -0.6356720  0.5445543  0.5056405

The mean is

[1] 0.1381743

The standard deviation is

[1] 0.6704529

Interval for Case 1

We have 𝑛=3 values, and we are estimating 1 value (the mean)

Thus, we have 3-1=2 degrees of freedom

The t distribution for 95% and 2 degrees of freedom is

[1] 4.302653

Thus, the 95%-confidence interval for the expression is

[1] -2.746552  3.022900

The interval contains 0, so it seems that the gene is not differentially expressed

Case 2

The values of first gene are

[1] 1.187004 1.071003 1.318096

The mean is

[1] 1.192034

The standard deviation is

[1] 0.1236232

Interval for Case 2

The t distribution for 95% and 2 degrees of freedom is

[1] 4.302653

Thus, the 95%-confidence interval for the expression is

[1] 0.6601268 1.7239418

The interval does not contain 0, so it seems that the gene is differentially expressed

Redoing at 99% confidence

The t distribution for 99% and 2 degrees of freedom is

[1] 9.924843

Thus, the 99%-confidence interval for the expression is

[1] -0.03490616  2.41897477

Now the interval contains 0, so it seems that the gene is not differentially expressed

Class 9: Linear models… again

Systems Biology

Andrés Aravena, PhD

October 26, 2021

lm versus lmFit

lm versus lmFit

From formula to design.matrix

Math to matrix

Two independent variables

General case

Model without intercept

General case without intercept

Factors

Example: sex

Formula to math

Formulas with factors

Formulas with factors and intercept

Interpretation

Contrasts

An easiest way to handle factors

Contrast describe what we want

Using contrasts in Limma

Using contrast in plain lm

Making contrasts manually

As a matrix

Solving for \(A,B,C\)

Doing math in R

Solving in R

Putting it all together

Before and after

Fitting the linear model

Confidence intervals and \(p\)-values

Covariance matrix

Other details

Log ratios

Log ratios

Power

Can we see the change

Case 1

Interval for Case 1

Case 2

Interval for Case 2

Redoing at 99% confidence

`lm` versus `lmFit`

`lm` versus `lmFit`

From `formula` to `design.matrix`

Using contrast in plain `lm`