We would like to know how genes/proteins/molecules interact
These interactions can be described with a network
So we would like to find the network that represent the interactions
At first, we want to know who interacts with who
Metabolic networks are directed bipartite graphs, with metabolites and reactions
If we know the stoichiometry of all reactions, we automatically know the metabolites
If we know the enzymes that can be encoded, then we know the reaction
This is the approach taken by KEGG and BioCyc
Several parts: some genes encode transcription factors, which can bind to DNA
We simplify to a directed bipartite network: genes and binding sites
We can predict if a gene encodes a transcription factor
We can predict if a site is a binding site
But we need experiments to see which transcription factor binds to which sites
Metabolic and transcriptional networks are very important, and we need several classes to study them
Today we begin with a simpler problem: gene interaction networks
These will be undirected non-bipartite networks
They may be later extended to a directed network with more attributes
First idea: Correlation
It is a square matrix, so we can see it as an adjacency matrix
Thus, we can draw a network
Depending on how we define “significant correlation”, we have more or less edges
If nothing is connected to nothing, we do not learn anything
The same happens if everything is connected to everything
Thresholding
Pruning
Regularization
Remember that we observe the sample correlation, which is usually different from the real correlation
Sample correlation may be non-zero even if the real correlation is zero
We can use a statistical test to decide which correlations are significant
This test defines a threshold, which depends on the number of nodes
Even after thresholding, we often have spurious links
If A is correlated with B, and B is correlated with C, then there will be a small (but significant) correlation between A and C
One pruning strategy is to drop the “weakest link” of every triangle
Another strategy is to discard the “weakest link” in every cycle
This results in a network with the shape of a tree
More specifically, we get a Spanning Tree
Expression has a random component
Genes may not be independent
This is called a multinormal distribution
For one variable, the Normal distribution is \[\frac{1}{\sqrt{2πσ^2}}\exp(-(x-μ)^2/2σ^2)\] where \(\sigma^2\) is the variance. The multinormal distribution is \[\frac{1}{\sqrt{(2π)^n\det(Σ)}}\exp(-(𝐱-𝐮)^T Σ^{-1} (𝐱-𝐮)/2)\] where \(𝐱\) is a vector and \(Σ\) is the covariance matrix
Ignoring the constants, replacing \(𝐲 = 𝐱-𝐮,\) and taking logarithms, we have that the probability depends on \[𝐲^T Σ^{-1} 𝐲\] so the relationship between the genes is given by \[K=Σ^{-1}\]
This \(K\) is called the precision matrix
(remember that the vector \(𝐲\) components are the gene expressions)
\[\begin{aligned} K&=\begin{pmatrix} 1.00 & -0.25 & 0.0\\ -0.25 & 1.00 & 0.3\\ 0.00 & 0.30 & 1.0\\ \end{pmatrix}\\ \Sigma&=\begin{pmatrix} 1.0737 & 0.2949 & -0.0884\\ 0.2949 & 1.1799 & -0.3539\\ -0.0884 & -0.3539 & 1.1061\\ \end{pmatrix}\end{aligned} \]
Covariance does not reflect the real relationships
As usual, the problem is that we do not know the real covariance, only the sample covariance
Moreover, if we have less samples than genes, then the sample covariance matrix cannot be inverted
Instead, we use methods to calculate pseudo-inverses
We will see one of these methods
One condition we can require from our solution is that each gene interacts only with a small number of genes
In general we do not expect that each of the thousand of genes interacts with most of the other thousands of genes
In other words, we expect that the adjacency matrix has mostly zeros
Matrix where the majority of the values are zero are called sparse
Lasso is a method used in linear models to force sparse coefficients
Instead of minimizing \(\sum_i \left(y_i - \sum_j β_j x_{ij}\right)^2,\) the lasso method minimizes \[\sum_i \left(y_i - \sum_j β_j x_{ij}\right)^2 - λ \sum_j\vert \beta_j\vert\]
The parameter \(λ\) has to be chosen by us.
Larger \(λ\) means sparser \(β_j,\) but bigger error
Using the same philosophy, one way to find a sparse pseudo-inverse of \(\Sigma\) is to find \(K\) that minimizes \[\log\det(K)-\text{tr}(Σ K) -λ\Vert K\Vert_1\]
Of course, this is done by the computer for us
We only must pay attention to \(λ\)
Depending on the value of \(\lambda,\) the result graphical lasso will be more or less sparse
That means that the networks will have more or less edges