lm
versus lmFit
We have seen two ways to use linear models
The math is the same, the code is different
lm
versus lmFit
lm
is a data frame with one colum for each variable
lmFit
is a matrix with genes/miRNA in the rows, experiments in the columns, and the expression level in each cell
formula
to design.matrix
The main difference between lm
and lmFit
is how to describe the model
If we have formula
, we can get the design matrix using
It is not hard to make your own design matrix if you know the math
\[y_i = β_0 + β_1 x_i + e_i\]
\[\begin{pmatrix} y_1\\ ⋮ \\ y_i\\ ⋮ \\ y_n \end{pmatrix} = \begin{pmatrix} 1 & x_1\\ ⋮ & ⋮ \\ 1 & x_i\\ ⋮ & ⋮ \\ 1 & x_n\\ \end{pmatrix} \begin{pmatrix}β_0\\β_1\end{pmatrix} + \begin{pmatrix} e_1\\ ⋮ \\ e_i\\ ⋮ \\ e_n \end{pmatrix}\]
The fitted model has the smallest value for \(∑_i e_i^2\)
\[y_i = β_0 + β_1 x_{i,1} + β_2 x_{i,2} + e_i\]
\[\begin{pmatrix} y_1\\ ⋮ \\ y_i\\ ⋮ \\ y_n \end{pmatrix} = \begin{pmatrix} 1 & x_{1,1} & x_{1,2}\\ ⋮ & ⋮ & ⋮ \\ 1 & x_{i,1} & x_{i,2}\\ ⋮ & ⋮ & ⋮ \\ 1 & x_{n,1} & x_{n,2}\\ \end{pmatrix} \begin{pmatrix}β_0\\β_1\\β_2\end{pmatrix} + \begin{pmatrix} e_1\\ ⋮ \\ e_i\\ ⋮ \\ e_n \end{pmatrix}\] \(β_0\) (intercept) is the value of \(y_i\) when all \(x_{i,j}=0\)
\[y_i = β_0 + ∑_{j=1}^k β_j x_{i,j} + e_i\]
\[\begin{pmatrix} y_1\\ ⋮ \\ y_i\\ ⋮ \\ y_n \end{pmatrix} = \begin{pmatrix} 1 & x_{1,1} &\cdots& x_{1,j} &\cdots& x_{1,k}\\ ⋮ & ⋮ & ⋱ & ⋮ & ⋱ & ⋮ \\ 1 & x_{i,1} &\cdots& x_{i,j} &\cdots& x_{i,k}\\ ⋮ & ⋮ & ⋱ & ⋮ & ⋱ & ⋮ \\ 1 & x_{n,1} &\cdots& x_{n,j} &\cdots& x_{n,k}\\ \end{pmatrix} \begin{pmatrix}β_0\\β_1\\⋮\\β_j\\⋮\\β_k\end{pmatrix} + \begin{pmatrix} e_1\\ ⋮ \\ e_i\\ ⋮ \\ e_n \end{pmatrix}\] \(β_1,…,β_k\) are the effects of each independent variables
\[y_i = β_0 + β_1 x_{i,1} + β_2 x_{i,2} + e_i\]
\[\begin{pmatrix} y_1\\ ⋮ \\ y_i\\ ⋮ \\ y_n \end{pmatrix} = \begin{pmatrix} x_{1,1} & x_{1,2}\\ ⋮ & ⋮ \\ x_{i,1} & x_{i,2}\\ ⋮ & ⋮ \\ x_{n,1} & x_{n,2}\\ \end{pmatrix} \begin{pmatrix}β_1\\β_2\end{pmatrix} + \begin{pmatrix} e_1\\ ⋮ \\ e_i\\ ⋮ \\ e_n \end{pmatrix}\] This is useful when \(x_{i,j}\) represent a factor, but may not be realistic when \(x_{i,j}\) is a number
\[y_i = ∑_{j=1}^k β_j x_{i,j} + e_i\]
\[\begin{pmatrix} y_1\\ ⋮ \\ y_i\\ ⋮ \\ y_n \end{pmatrix} = \begin{pmatrix} x_{1,1} & x_{1,j} & x_{1,k}\\ ⋮ & ⋮ & ⋮ \\ x_{i,1} & x_{i,j} & x_{i,k}\\ ⋮ & ⋮ & ⋮ \\ x_{n,1} & x_{n,j} & x_{n,k}\\ \end{pmatrix} \begin{pmatrix}β_1\\⋮\\β_j\\⋮\\β_k\end{pmatrix} + \begin{pmatrix} e_1\\ ⋮ \\ e_i\\ ⋮ \\ e_n \end{pmatrix}\]
Let’s say that \(f\) is a factor with \(k\) levels \((f_i ∈\{l_1,…,l_k\})\)
We represent it with \(k\) columns
\[x_{i,j} = \begin{cases}1\text{ if }f_i\text{ is equal to }l_j\\ 0\text{ if }f_i\text{ is not equal to }l_j\end{cases}\]
\[y_i = β_1 x_{i,1} + β_2 x_{i,2} + e_i\]
\[\begin{pmatrix}
y_1\\
⋮ \\
y_n
\end{pmatrix} = \begin{pmatrix}
x_{1,1} & x_{1,2}\\
⋮ & ⋮ \\
x_{n,1} & x_{n,2}\\
\end{pmatrix}
\begin{pmatrix}β_1\\β_2\end{pmatrix} +
\begin{pmatrix}
e_1\\
⋮ \\
e_n
\end{pmatrix}\] If, for each female we have \(x_{i,1}=1\) and \(x_{i,2}=0,\)
then \(β_1\) will be the mean of \(y_i\) for all female
Let’s say that x
is a numeric variable
y ~ x
is \(y_i=β_0 + β_1 x_i + e_i\)
y ~ x+0
is \(y_i=β_1 x_i + e_i\)
y ~ x1 + x2 + x3
is \(y_i=β_0 + β_1 x_{i,1} + β_2 x_{i,2} + β_3 x_{i,3} + e_i\)
y ~ x1 + x2 + x3 + 0
is \(y_i=β_1 x_{i,1} + β_2 x_{i,2} + β_3 x_{i,3} + e_i\)
Let’s say that f
is a factor with \(k\) levels \((f_i ∈\{l_1,…,l_k\})\)
y ~ f+0
is \(y_i=\sum_{j=1}^k β_j x_{i,j} +e_i,\) where \[x_{i,j} = \begin{cases}1\text{ if }f_i\text{ is equal to }l_j\\
0\text{ if }f_i\text{ is not equal to }l_j\end{cases}\]
fA fB fC
1 1 0 0
2 0 1 0
3 0 0 1
Now the first level is not encoded
y ~ x
is \(y_i=β_0 + \sum_{j=2}^k β_j x_{i,j} +e_i\)
(Intercept) fB fC
1 1 0 0
2 1 1 0
3 1 0 1
It is critical to choose well the first level
It is the baseline
\[y_i=β_0 + \sum_{j=2}^k β_j x_{i,j} +e_i\]
What if we are not interested in C-A
?
What if we want C-B
and B-A
?
(maybe each level is a different time, and we want to see the daily change)
A model with intercept will not work for us
Instead we use a model without intercept, and contrasts
In limma there is a function makeContrasts(…, levels)
.
We describe what we want, and what are all the levels
It encodes them in a matrix
Contrasts
Levels C - B B - A
A 0 -1
B -1 1
C 1 0
lm
The function lm
can take a contrast=
option, but it is not easy to use
(Maybe there is an easier way that I don’t know)
Instead, we can just do the math
Start by declaring what you want to get
We need the same number of equations and variables
then we need to solve for \(A,B,C\)
\[\begin{aligned} β_1 &= C - B\\ β_2 &= B - A\\ β_3 &= A\end{aligned}\]
\[\begin{pmatrix} β_1\\ β_2\\ β_3\end{pmatrix}= \begin{pmatrix} 0 & -1 & 1\\ -1 & 1 & 0\\ 1 & 0 & 0\end{pmatrix} \begin{pmatrix}A\\B\\C\end{pmatrix}\]
\[\begin{pmatrix}A\\B\\C\end{pmatrix} = \begin{pmatrix} 0 & -1 & 1\\ -1 & 1 & 0\\ 1 & 0 & 0\end{pmatrix}^{-1} \begin{pmatrix}β_1\\β_2\\β_3\end{pmatrix}\]
Make a square contrasts
matrix
(makeContrasts
may help, but it is not enough)
A B C
C - B 0 -1 1
B - A -1 1 0
A 1 0 0
One row for each equation
(warning: this is different in makeContrasts
)
Now we can solve the equation system
C - B B - A A
A 0 0 1
B 0 1 1
C 1 1 1
This tell us how to change the design matrix to get what we want
First, make the design matrix
Since there are no intercepts, each column is a level.
Now, update the design multiplying by the contrasts
Here %*%
is matrix multiplication (i.e. rows by columns)
design matrix
A B C
1 1 0 0
2 1 0 0
3 1 0 0
4 0 1 0
5 0 1 0
6 0 1 0
7 0 0 1
8 0 0 1
9 0 0 1
design with contrasts
C - B B - A A
1 0 0 1
2 0 0 1
3 0 0 1
4 0 1 1
5 0 1 1
6 0 1 1
7 1 1 1
8 1 1 1
9 1 1 1
Finally we fit our model to the data
Here the formula contains .
, which means all variables in the data
We also have +0
meaning no intercept
All this will be useless if we cannot get confidence intervals
The key part is to evaluate the standard errors, which depend on the variance
Remember that \[var(λ X_j+ ρ X_k) = λ^2var(X_j) + ρ^2var(X_k) + 2 λ ρ cov(X_j,X_k)\] and \[cov(X_j,X_j) = var(X_j)\]
The matrix \(C = cov(X_j,X_k)\) has variances in the diagonal
We find that \[var(λ X_j+ ρ X_k) = (\lambda\quad\rho)⋅ C⋅ \begin{pmatrix}\lambda\\\rho\end{pmatrix}\]
In other words, the contrast matrix can also be used to update the covariance matrix
y1 <- sample(1:100, size=1000, replace=TRUE)
y2 <- sample(1:100, size=1000, replace=TRUE)
hist(y2/y1, nclass=30)
abline(v=1, col="red", lwd=3)
Here we have the differential expression of some genes
Replica 1 | Replica 2 | Replica 3 |
---|---|---|
-0.6356720 | 0.5445543 | 0.5056405 |
0.9198619 | -0.6887110 | -0.2273942 |
1.1870043 | 1.0710029 | 1.3180957 |
0.1376069 | 1.7086511 | 1.1611300 |
0.8551033 | -1.0060231 | 0.4222059 |
There are three biological replicas for each gene
The values of first gene are
[1] -0.6356720 0.5445543 0.5056405
The mean is
[1] 0.1381743
The standard deviation is
[1] 0.6704529
We have 𝑛=3 values, and we are estimating 1 value (the mean)
Thus, we have 3-1=2 degrees of freedom
The t distribution for 95% and 2 degrees of freedom is
[1] 4.302653
Thus, the 95%-confidence interval for the expression is
[1] -2.746552 3.022900
The interval contains 0, so it seems that the gene is not differentially expressed
The values of first gene are
[1] 1.187004 1.071003 1.318096
The mean is
[1] 1.192034
The standard deviation is
[1] 0.1236232
The t distribution for 95% and 2 degrees of freedom is
[1] 4.302653
Thus, the 95%-confidence interval for the expression is
[1] 0.6601268 1.7239418
The interval does not contain 0, so it seems that the gene is differentially expressed
The t distribution for 99% and 2 degrees of freedom is
[1] 9.924843
Thus, the 99%-confidence interval for the expression is
[1] -0.03490616 2.41897477
Now the interval contains 0, so it seems that the gene is not differentially expressed