Class 9: Linear models… again

Systems Biology

Andrés Aravena, PhD

October 26, 2021

lm versus lmFit

We have seen two ways to use linear models

  1. To model a single variable (CT of one gene, number of cells, methylation level, etc.) we use just R
fitted.model <- lm(formula, data)
  1. To model many genes in an array or RNAseq, we use the limma package
fitted.model <- lmFit(design.matrix, data)

The math is the same, the code is different

lm versus lmFit

  • Data for lm is a data frame with one colum for each variable
    • for example: diet/stress, tissue, age
  • Data for lmFit is a matrix with genes/miRNA in the rows, experiments in the columns, and the expression level in each cell
    • We need an extra table describing the conditions of each experiment

From formula to design.matrix

The main difference between lm and lmFit is how to describe the model

  • formula
  • matrix

If we have formula, we can get the design matrix using

design.matrix <- model.matrix(formula)

It is not hard to make your own design matrix if you know the math

Math to matrix

\[y_i = β_0 + β_1 x_i + e_i\]

\[\begin{pmatrix} y_1\\ ⋮ \\ y_i\\ ⋮ \\ y_n \end{pmatrix} = \begin{pmatrix} 1 & x_1\\ ⋮ & ⋮ \\ 1 & x_i\\ ⋮ & ⋮ \\ 1 & x_n\\ \end{pmatrix} \begin{pmatrix}β_0\\β_1\end{pmatrix} + \begin{pmatrix} e_1\\ ⋮ \\ e_i\\ ⋮ \\ e_n \end{pmatrix}\]

The fitted model has the smallest value for \(∑_i e_i^2\)

Two independent variables

\[y_i = β_0 + β_1 x_{i,1} + β_2 x_{i,2} + e_i\]

\[\begin{pmatrix} y_1\\ ⋮ \\ y_i\\ ⋮ \\ y_n \end{pmatrix} = \begin{pmatrix} 1 & x_{1,1} & x_{1,2}\\ ⋮ & ⋮ & ⋮ \\ 1 & x_{i,1} & x_{i,2}\\ ⋮ & ⋮ & ⋮ \\ 1 & x_{n,1} & x_{n,2}\\ \end{pmatrix} \begin{pmatrix}β_0\\β_1\\β_2\end{pmatrix} + \begin{pmatrix} e_1\\ ⋮ \\ e_i\\ ⋮ \\ e_n \end{pmatrix}\] \(β_0\) (intercept) is the value of \(y_i\) when all \(x_{i,j}=0\)

General case

\[y_i = β_0 + ∑_{j=1}^k β_j x_{i,j} + e_i\]

\[\begin{pmatrix} y_1\\ ⋮ \\ y_i\\ ⋮ \\ y_n \end{pmatrix} = \begin{pmatrix} 1 & x_{1,1} &\cdots& x_{1,j} &\cdots& x_{1,k}\\ ⋮ & ⋮ & ⋱ & ⋮ & ⋱ & ⋮ \\ 1 & x_{i,1} &\cdots& x_{i,j} &\cdots& x_{i,k}\\ ⋮ & ⋮ & ⋱ & ⋮ & ⋱ & ⋮ \\ 1 & x_{n,1} &\cdots& x_{n,j} &\cdots& x_{n,k}\\ \end{pmatrix} \begin{pmatrix}β_0\\β_1\\⋮\\β_j\\⋮\\β_k\end{pmatrix} + \begin{pmatrix} e_1\\ ⋮ \\ e_i\\ ⋮ \\ e_n \end{pmatrix}\] \(β_1,…,β_k\) are the effects of each independent variables

Model without intercept

\[y_i = β_0 + β_1 x_{i,1} + β_2 x_{i,2} + e_i\]

\[\begin{pmatrix} y_1\\ ⋮ \\ y_i\\ ⋮ \\ y_n \end{pmatrix} = \begin{pmatrix} x_{1,1} & x_{1,2}\\ ⋮ & ⋮ \\ x_{i,1} & x_{i,2}\\ ⋮ & ⋮ \\ x_{n,1} & x_{n,2}\\ \end{pmatrix} \begin{pmatrix}β_1\\β_2\end{pmatrix} + \begin{pmatrix} e_1\\ ⋮ \\ e_i\\ ⋮ \\ e_n \end{pmatrix}\] This is useful when \(x_{i,j}\) represent a factor, but may not be realistic when \(x_{i,j}\) is a number

General case without intercept

\[y_i = ∑_{j=1}^k β_j x_{i,j} + e_i\]

\[\begin{pmatrix} y_1\\ ⋮ \\ y_i\\ ⋮ \\ y_n \end{pmatrix} = \begin{pmatrix} x_{1,1} & x_{1,j} & x_{1,k}\\ ⋮ & ⋮ & ⋮ \\ x_{i,1} & x_{i,j} & x_{i,k}\\ ⋮ & ⋮ & ⋮ \\ x_{n,1} & x_{n,j} & x_{n,k}\\ \end{pmatrix} \begin{pmatrix}β_1\\⋮\\β_j\\⋮\\β_k\end{pmatrix} + \begin{pmatrix} e_1\\ ⋮ \\ e_i\\ ⋮ \\ e_n \end{pmatrix}\]

Factors

Let’s say that \(f\) is a factor with \(k\) levels \((f_i ∈\{l_1,…,l_k\})\)

We represent it with \(k\) columns

\[x_{i,j} = \begin{cases}1\text{ if }f_i\text{ is equal to }l_j\\ 0\text{ if }f_i\text{ is not equal to }l_j\end{cases}\]

Example: sex

\[y_i = β_1 x_{i,1} + β_2 x_{i,2} + e_i\]

\[\begin{pmatrix} y_1\\ ⋮ \\ y_n \end{pmatrix} = \begin{pmatrix} x_{1,1} & x_{1,2}\\ ⋮ & ⋮ \\ x_{n,1} & x_{n,2}\\ \end{pmatrix} \begin{pmatrix}β_1\\β_2\end{pmatrix} + \begin{pmatrix} e_1\\ ⋮ \\ e_n \end{pmatrix}\] If, for each female we have \(x_{i,1}=1\) and \(x_{i,2}=0,\)
then \(β_1\) will be the mean of \(y_i\) for all female

Formula to math

Let’s say that x is a numeric variable

  • y ~ x is \(y_i=β_0 + β_1 x_i + e_i\)

  • y ~ x+0 is \(y_i=β_1 x_i + e_i\)

  • y ~ x1 + x2 + x3 is \(y_i=β_0 + β_1 x_{i,1} + β_2 x_{i,2} + β_3 x_{i,3} + e_i\)

  • y ~ x1 + x2 + x3 + 0 is \(y_i=β_1 x_{i,1} + β_2 x_{i,2} + β_3 x_{i,3} + e_i\)

Formulas with factors

Let’s say that f is a factor with \(k\) levels \((f_i ∈\{l_1,…,l_k\})\)

y ~ f+0 is \(y_i=\sum_{j=1}^k β_j x_{i,j} +e_i,\) where \[x_{i,j} = \begin{cases}1\text{ if }f_i\text{ is equal to }l_j\\ 0\text{ if }f_i\text{ is not equal to }l_j\end{cases}\]

f <- factor(c("A","B","C"))
model.matrix(~f+0) |> as.data.frame()
  fA fB fC
1  1  0  0
2  0  1  0
3  0  0  1

Formulas with factors and intercept

Now the first level is not encoded

y ~ x is \(y_i=β_0 + \sum_{j=2}^k β_j x_{i,j} +e_i\)

f <- factor(c("A","B","C"))
model.matrix(~f) |> as.data.frame()
  (Intercept) fB fC
1           1  0  0
2           1  1  0
3           1  0  1

It is critical to choose well the first level
It is the baseline

Interpretation

\[y_i=β_0 + \sum_{j=2}^k β_j x_{i,j} +e_i\]

  • \(β_0\) is the mean of \(y_i\) for all cases \(f_i\) is \(l_1\) \[β_0=\text{mean}(y_i \vert f_i=l_1)\]
  • \(β_j\) is the difference of means between \(l_j\) and \(l_1\) \[β_j=\text{mean}(y_i | f_i=l_j)-\text{mean}(y_i | f_i=l_1)\]

Contrasts

An easiest way to handle factors

What if we are not interested in C-A?
What if we want C-B and B-A?
(maybe each level is a different time, and we want to see the daily change)

A model with intercept will not work for us

Instead we use a model without intercept, and contrasts

Contrast describe what we want

In limma there is a function makeContrasts(…, levels).
We describe what we want, and what are all the levels
It encodes them in a matrix

makeContrasts(C-B, B-A, levels=f)
      Contrasts
Levels C - B B - A
     A     0    -1
     B    -1     1
     C     1     0

Using contrasts in Limma

design.matrix <- model.matrix(~ f + 0)
fit <- lmFit(data, design.matrix)

contrasts <- makeContrasts(C-B, B-A, levels=design.matrix)
fit.contrasts <- contrasts.fit(fit, contrasts)
eb <- eBayes(fit.contrasts)

Using contrast in plain lm

The function lm can take a contrast= option, but it is not easy to use

  • It only works with intercept
  • If there are several factors, they are handled independently

(Maybe there is an easier way that I don’t know)

Instead, we can just do the math

Making contrasts manually

Start by declaring what you want to get

  • \(β_1 = C - B\)
  • \(β_2 = B - A\)

We need the same number of equations and variables

  • \(β_3 = A\)

then we need to solve for \(A,B,C\)

As a matrix

\[\begin{aligned} β_1 &= C - B\\ β_2 &= B - A\\ β_3 &= A\end{aligned}\]

\[\begin{pmatrix} β_1\\ β_2\\ β_3\end{pmatrix}= \begin{pmatrix} 0 & -1 & 1\\ -1 & 1 & 0\\ 1 & 0 & 0\end{pmatrix} \begin{pmatrix}A\\B\\C\end{pmatrix}\]

Solving for \(A,B,C\)

\[\begin{pmatrix}A\\B\\C\end{pmatrix} = \begin{pmatrix} 0 & -1 & 1\\ -1 & 1 & 0\\ 1 & 0 & 0\end{pmatrix}^{-1} \begin{pmatrix}β_1\\β_2\\β_3\end{pmatrix}\]

Doing math in R

Make a square contrasts matrix
(makeContrasts may help, but it is not enough)

contrasts
       A  B C
C - B  0 -1 1
B - A -1  1 0
A      1  0 0

One row for each equation
(warning: this is different in makeContrasts)

Solving in R

Now we can solve the equation system

solve(contrasts)
  C - B B - A A
A     0     0 1
B     0     1 1
C     1     1 1

This tell us how to change the design matrix to get what we want

Putting it all together

First, make the design matrix

design.matrix <- model.matrix(~ f + 0)

Since there are no intercepts, each column is a level.
Now, update the design multiplying by the contrasts

design.with.contrasts <- design.matrix %*% solve(contrasts)

Here %*% is matrix multiplication (i.e. rows by columns)

Before and after

design matrix

  A B C
1 1 0 0
2 1 0 0
3 1 0 0
4 0 1 0
5 0 1 0
6 0 1 0
7 0 0 1
8 0 0 1
9 0 0 1

design with contrasts

  C - B B - A A
1     0     0 1
2     0     0 1
3     0     0 1
4     0     1 1
5     0     1 1
6     0     1 1
7     1     1 1
8     1     1 1
9     1     1 1

Fitting the linear model

Finally we fit our model to the data

lm(y ~ . + 0, data=design.with.contrasts)

Here the formula contains ., which means all variables in the data

We also have +0 meaning no intercept

Confidence intervals and \(p\)-values

All this will be useless if we cannot get confidence intervals

The key part is to evaluate the standard errors, which depend on the variance

Remember that \[var(λ X_j+ ρ X_k) = λ^2var(X_j) + ρ^2var(X_k) + 2 λ ρ cov(X_j,X_k)\] and \[cov(X_j,X_j) = var(X_j)\]

Covariance matrix

The matrix \(C = cov(X_j,X_k)\) has variances in the diagonal

We find that \[var(λ X_j+ ρ X_k) = (\lambda\quad\rho)⋅ C⋅ \begin{pmatrix}\lambda\\\rho\end{pmatrix}\]

In other words, the contrast matrix can also be used to update the covariance matrix

Other details

Log ratios

y1 <- sample(1:100, size=1000, replace=TRUE)
y2 <- sample(1:100, size=1000, replace=TRUE)
hist(y2/y1, nclass=30)
abline(v=1, col="red", lwd=3)

Log ratios

hist(log(y2) - log(y1), nclass=30)
abline(v=0, col="red", lwd=3)

Power

Can we see the change

Here we have the differential expression of some genes

Replica 1 Replica 2 Replica 3
-0.6356720 0.5445543 0.5056405
0.9198619 -0.6887110 -0.2273942
1.1870043 1.0710029 1.3180957
0.1376069 1.7086511 1.1611300
0.8551033 -1.0060231 0.4222059

There are three biological replicas for each gene

Case 1

The values of first gene are

[1] -0.6356720  0.5445543  0.5056405

The mean is

[1] 0.1381743

The standard deviation is

[1] 0.6704529

Interval for Case 1

We have 𝑛=3 values, and we are estimating 1 value (the mean)

Thus, we have 3-1=2 degrees of freedom

The t distribution for 95% and 2 degrees of freedom is

[1] 4.302653

Thus, the 95%-confidence interval for the expression is

[1] -2.746552  3.022900

The interval contains 0, so it seems that the gene is not differentially expressed

Case 2

The values of first gene are

[1] 1.187004 1.071003 1.318096

The mean is

[1] 1.192034

The standard deviation is

[1] 0.1236232

Interval for Case 2

The t distribution for 95% and 2 degrees of freedom is

[1] 4.302653

Thus, the 95%-confidence interval for the expression is

[1] 0.6601268 1.7239418

The interval does not contain 0, so it seems that the gene is differentially expressed

Redoing at 99% confidence

The t distribution for 99% and 2 degrees of freedom is

[1] 9.924843

Thus, the 99%-confidence interval for the expression is

[1] -0.03490616  2.41897477

Now the interval contains 0, so it seems that the gene is not differentially expressed