Modeling regulations explaining co-expressions parsimoniously

November 4, 2015

My name is Andrés Aravena

Türkçe bilmiyorum 😟
I am

Assistant Professor at IU Mol. Biology and Genomics Dpt.
Mathematical Engineer, U. of Chile
PhD Informatics, U Rennes 1, France
PhD Mathematical Modeling, U. of Chile
not a Biologist
but an Applied Mathematician who can speak “biologist language”

I’ve worked on

Big and small computers
Telecommunication Networks
Between 2003 and 2014 I was the chief research engineer
- on the main bioinformatic group in my country (MATHomics)
- in the top research center (Center for Mathematical Modeling)
- in the top university of my country (University of Chile)

I come from Chile

world

Chile

chile

Near 17 million people

Universities ranks similar to Turkish ones

Spanish colony 500 years ago (so language is Spanish)

Independent Republic 200 years ago

First Latin American country to recognize Turkish republic

Everyday life very similar to Turkey

Chilean Economy: Exports

exports

1st world producer of copper

2nd world producer of salmon

Fruits: peaches, grapes, apples, avocado

Wine: exported worldwide

Biotechnology can improve all these industries

Official data for 2014. Banco Central de Chile

Some projects I collaborated

Grapefruit:
- development related to seed and grape size
Wine:
- quality control on exported wine,
- avoid secondary fermentation
Salmon:
- effect of diet on metabolism,
Mining:
- copper extraction using bacteria

Copper is heated and melt

to separate it from other compounds

This is
very expensive …

… and contaminant

(this smoke is sulphuric acid)

Solution: Bioleaching

The use of bacteria to extract elements from ore

Bioleaching is much better that melting copper

Reduced contamination
Cheaper

The goal is to understand and improve the involved bacteria so this technology can be used extensively

Enables building new mines

It is like discovering petrol reserves for the country

Bioleaching bacteria

We had a research contract with the main mining company

State owned, big enough to pay for long term research

We focused mainly on 2 questions:

Monitoring the microbial community in the mine
Understanding how these bacteria do “mining”

Monitoring Environmental Community

We developed models and tools to design

qPCR primers
oligos for microarrays
statistical models
practical software tools

that enable quick and precise detection and quantification of the complex metagenome

Most of the results are still industrial secret

Patented in: USA, South Africa, Australia, Mexico, Peru, Chine, Chile and Argentina

N. Ehrenfeld, A. Aravena, A. Reyes-Jara, N. Barreto, R. Assar, A. Maass, P. Parada, Design and use of oligonucleotide microarrays for identification of Biomining microorganisms. Advanced Materials Research 71-73 (2009) 155-158.

Understanding Bioleaching Bacteria

We sequenced the genome of the three most abundant species:

Acidithiobacillus ferrooxidans
Acidithiobacillus thiooxidans
Leptospirillum ferrooxidans

For the first one we also

hybridized near 100 microarrays
analyzed metabolome under several conditions

We focused on several questions.

One of the key ones was:
understanding transcriptional regulation
Limitations:
- Cell modification is not feasible
- Knock out is not feasible

Our Approach

Modeling regulation by integrating genomic and transcriptomic data.

Microarray results for several stress conditions
- identification of co-expressed genes
Annotated genomic sequence
- Identify putative Transcription Factors and Binding Sites.

Using E.coli for model evaluation

Since A. ferrooxidans regulation data is scarce, we use E.coli as a test platform.

Genomic sequence available
- 4523 genes
Differential expression data available:
- 907 arrays in M^3D
Several experimentally validated regulations described in the literature
- RegulonDB 8.1 describes 2650 E.coli operons.
- Describes 1652 regulations between operons.

Co-expression

Identification of sets of genes sharing similar behaviors through different environmental conditions, by

Linear Correlation, or
Mutual Information (many methods)

Result: Big influence graphs where 2 genes are connected when they are “similar”. Millions of edges

Problem: Confusion between direct and indirect relationships

Significant co-expression

Several approaches exist to separate direct and indirect relationships:

Relevance Networks, ARACNe, C3NET, MRNET

Network size reduces 10-20 times

Influence graphs

Abstraction describing empirical co-expression between genes

Several noise sources can affect the result
Unmeasurable changes can affect regulation
They often do not represent physical interactions
They cannot be interpreted causally

Even so, they convey information about the underlying transcriptional mechanisms.

Pairs of co-expressed operons

We assume that all genes in each operon are co-expressed. This simplifies the analysis

I used Maximal Relevance/Minimal Redundancy criterion (MRNET) to determine co-expressed operons (edges of the influence network)

Result: Influence network with 61,506 edges.
6 of them are validated regulations

How to explain the other co-expressions?

Physical Interaction networks

A transcriptional regulatory network (TRN) is a physical model of the interactions

from regulators: genes coding for Transcription Factors
to target genes: those having a Binding Site for the TF in the promotor region
Modulate the global expression of genes through regulatory cascades.

Model: Explaining co-expression

Co-expression is explained by the existence of a common regulator acting on them directly or indirectly through a regulatory cascade. Either:

There is a directed path from one gene to the other. The first is regulating the last by a regulatory cascade.
None of the genes is regulating the other but both are co-regulated by a third gene.
- This case is represented in the network by a v-shape: two paths from a common regulator to each co-regulated gene.

v-shapes

For a given pair of co-regulated genes A and B, we want to find the possible explanations for their co-regulation.
Thus, we call an explanation of A and B to any path from A to B or from B to A or any set of arcs forming a v-shape between them.

Experimental regulations explain few co-expressions

The network of experimentally validated regulations described in RegulonDB only explains 3,990 (6.5%) of the 61,506 observed co-expressions.

Only a few co-expressions were explained by a single validated arc
The rest could only be explained through regulatory cascades.

Predicted TRN can explain most co-expressions

A putative TRN was built using E.coli genomic sequence and patterns from Prodoric database of transcription factors and binding sites.

We found that this putative TRN explained 91.1% of the pairs of co-expressed operons.

Building a putative TRN

Putative TRNs are usually huge

Putative TRNs are usually huge, due to the low specificity of methods based on the sequence.

Putative TRN has 25,604 regulations
Predicted regulations may not be real
But contains regulations that explain 91.1% of co-expressions
A realistic subnetwork can be chosen in a biologically meaningful way

This is the main motivation of our model

Lombarde

our model

Graphical Illustration

Overview of LOMBARDE

The LOMBARDE method requires for the studied organism the following input:

a putative TRN represented by a weighted directed graph \(\mathcal G\), with vertices corresponding to genes and arcs connecting regulator genes to regulated ones.
- An arc connects two genes if the first gene codes for a transcription factor that presumably binds in the promoting region of the second gene.
- The \(p\)-value \(p_i\) associated with this arc reflects the confidence level of this prediction.
a set \(\mathcal O\) of pairs of co-expressed genes.

Overview of LOMBARDE

In a first stage LOMBARDE assigns to each arc a discrete cost \(w_i\) in a way such that the more confident arcs have lower cost. \[w_i = F(p_i)\]
LOMBARDE discretizes the \(p\)-values into \(k\) categories.
This allows to define the function \(Cost(S)\) for any subgraph \(S\) as the sum of the costs of its arcs. \[Cost(S)=\sum_{i\in S}w_i\].

Costs of arcs

To avoid “shortcuts”
we use costs that grow
exponentially

Better ten “good” steps
at cost 1
than one “weak” step
at cost 10

Overview of LOMBARDE

In a second stage LOMBARDE deciphers the co-expression of the pair \((gene_{1}, gene_{2})\in \mathcal O\) by identifying a common regulator \(gene_{3}\) which is connected to both \(gene_{1}\) and \(gene_{2}\) via regulatory cascades of high confidence.
In graph terms, a subgraph \(S\) is an v-shape for the pair \((gene_{1}, gene_{2})\) if \(S\) is the union of two independent paths from \(gene_{3}\) (the common regulator) to \(gene_{1}\) and to \(gene_{2}\).

Confident explanations

An v-shape for \((gene_{1}, gene_{2})\) is said to be confident if it is of minimum cost among all the explanations for the pair.
Our model transforms a parsimony cirteria into a graph minimization problem.
The output of LOMBARDE is a subgraph \(\mathcal L\) of \(\mathcal G\) built as the union of all confident explanations for each co-expressed pair in \(\mathcal O\).

Results

LOMBARDE results are biased towards validated regulatory interactions

The putative TRN for E.coli contains 25,604 arcs, 444 of them are experimentally validated.
After applying LOMBARDE most of its arcs are discarded, keeping only 4,922 (19.0%).
However, among the validated arcs, LOMBARDE is less aggressive, keeping 295 (66.4%) of them.

LOMBARDE results are biased towards validated regulatory interactions

This shows that the output of LOMBARDE is biased towards experimentally validated regulations.
- An hypergeometric test confirms this bias, with an enrichment \(p\)-value under \(10^{-105}\).

LOMBARDE can complete partially known TRN

We also considered an extended TRN combining all E.coli validated regulations and all arcs in the putative TRN
Near 30% of non-validated arcs are replaced by a set of similar size where almost all arcs are validated.
There is a core of regulations preserved in LOMBARDE output

Even without experimental data results are good

Venn Diagram

\[\newcommand{\VVecoli}{E.coli}\newcommand{\GV}{G_V}\newcommand{\GAF}{\mathcal G}\newcommand{\GAV}{\mathcal G_e}\]

Summary of Results

Network	Explained co-expressions	Num. Vertices	Num. Arcs	Num. Arcs in \(\VVecoli\)
TRN built from E.coli	3,990 (6.5%)	823	1,652	1,652
E.coli ab initio \(\GAF\)	56,044 (91.1%)	2,390	25,604	444
Lombarde output \(\mathcal L\)	56,044 (91.1%)	2,336	4,922	295
E.coli extended \(\GAV\)	56,789 (92.3%)	2,434	26,812	1,652
Lombarde output \(\mathcal L_e\)	56,789 (92.3%)	2,370	4,374	1,520

LOMBARDE produces a topologically realistic TRN

Average degree (number of interactions per operon) of the putative TRN was 10.7
The value suggested in literature is in the range 1.5 to 2.0.
Average degree of LOMBARDE output was 2.1.
This is also close to the average degree in the network of validated regulations for E.coli, 2.0.

Degree distribution

The degree distribution (proportion of operons for each degree) in LOMBARDE output is similar to the network of validated regulations, meaning that they share some structural properties.

Global relevance of regulators can be evaluated using centrality indices

The network produced by LOMBARDE also contains most of the global regulators described for E.coli

Using the radiality index, we could rank the regulators on LOMBARDE output. Among the most relevant regulators in this network we recovered 10 of the known global regulators.

When LOMBARDE was applied to the extended input, the result recovers 18 of the known global regulators, 14 of them among the most relevant ones.

Core of predicted E.coli regulators

Ranking of predicted E.coli regulators

Gene name	Ranking in literature	Ranking for radiality index in Lombarde output for \(\GAF\)	Ranking for radiality index in Lombarde output for \(\GAV\)
crp	1	25	1
ihfA	2	14	4
ihfB	3	16	5
fnr	4	1	6
fis	5	63	2
arcA	6	13	7
lrp	7	34	87
hns	8	—	14
narL	9	121	126
ompR	10	143	96
fur	11	7	8
phoB	12	9	25
cpxR	13	80	22
soxR	14	69	49
soxS	15	109	18
mtfA	16	—	—
cspA	17	—	42
rob	18	30	95
purR	19	39	47

Results for A.ferrooxidans

64 regulators identified
19 of them have no known function
Enrchment of Nitrogen related regulators
- Nitrogen fixation has been identified as a relevant factor in bioleaching (Levican et al, 2008)

Conclusion

LOMBARDE produces networks with realistic degree distributions, recovering and giving a central role to most of the global regulators described in literature.

In other words, LOMBARDE shapes the resulting network towards the structural characteristics of a true regulatory network.