November 4, 2015

My name is Andrés Aravena

Türkçe bilmiyorum 😟
I am

  • Assistant Professor at IU Mol. Biology and Genomics Dpt.
  • Mathematical Engineer, U. of Chile
  • PhD Informatics, U Rennes 1, France
  • PhD Mathematical Modeling, U. of Chile
  • not a Biologist
  • but an Applied Mathematician who can speak “biologist language”

I’ve worked on

  • Big and small computers
  • Telecommunication Networks
  • Between 2003 and 2014 I was the chief research engineer
    • on the main bioinformatic group in my country (MATHomics)
    • in the top research center (Center for Mathematical Modeling)
    • in the top university of my country (University of Chile)

I come from Chile

world

Chile

chile

Near 17 million people

Universities ranks similar to Turkish ones

Spanish colony 500 years ago (so language is Spanish)

Independent Republic 200 years ago

First Latin American country to recognize Turkish republic

Everyday life very similar to Turkey

Chilean Economy: Exports

exports

1st world producer of copper

2nd world producer of salmon

Fruits: peaches, grapes, apples, avocado

Wine: exported worldwide

Biotechnology can improve all these industries

Official data for 2014. Banco Central de Chile

Some projects I collaborated

  • Grapefruit:
    • development related to seed and grape size
  • Wine:
    • quality control on exported wine,
    • avoid secondary fermentation
  • Salmon:
    • effect of diet on metabolism,
  • Mining:
    • copper extraction using bacteria

Copper is heated and melt

to separate it from other compounds

This is
very expensive

… and contaminant

(this smoke is sulphuric acid)

Solution: Bioleaching

The use of bacteria to extract elements from ore

Bioleaching is much better that melting copper

  • Reduced contamination
  • Cheaper

The goal is to understand and improve the involved bacteria so this technology can be used extensively

Enables building new mines

It is like discovering petrol reserves for the country

Bioleaching bacteria

We had a research contract with the main mining company

State owned, big enough to pay for long term research

We focused mainly on 2 questions:

  • Monitoring the microbial community in the mine
  • Understanding how these bacteria do “mining”

Monitoring Environmental Community

We developed models and tools to design

  • qPCR primers
  • oligos for microarrays
  • statistical models
  • practical software tools

that enable quick and precise detection and quantification of the complex metagenome

Most of the results are still industrial secret

Patented in: USA, South Africa, Australia, Mexico, Peru, Chine, Chile and Argentina

N. Ehrenfeld, A. Aravena, A. Reyes-Jara, N. Barreto, R. Assar, A. Maass, P. Parada, Design and use of oligonucleotide microarrays for identification of Biomining microorganisms. Advanced Materials Research 71-73 (2009) 155-158.

Understanding Bioleaching Bacteria

We sequenced the genome of the three most abundant species:

  • Acidithiobacillus ferrooxidans
  • Acidithiobacillus thiooxidans
  • Leptospirillum ferrooxidans

For the first one we also

  • hybridized near 100 microarrays
  • analyzed metabolome under several conditions

We focused on several questions.

  • One of the key ones was:
    understanding transcriptional regulation

  • Limitations:

    • Cell modification is not feasible
    • Knock out is not feasible

Our Approach

Our Approach

Modeling regulation by integrating genomic and transcriptomic data.

  • Microarray results for several stress conditions
    • identification of co-expressed genes
  • Annotated genomic sequence
    • Identify putative Transcription Factors and Binding Sites.

Using E.coli for model evaluation

Since A. ferrooxidans regulation data is scarce, we use E.coli as a test platform.

  • Genomic sequence available
    • 4523 genes
  • Differential expression data available:
    • 907 arrays in M3D
  • Several experimentally validated regulations described in the literature
    • RegulonDB 8.1 describes 2650 E.coli operons.
    • Describes 1652 regulations between operons.

Co-expression

Identification of sets of genes sharing similar behaviors through different environmental conditions, by

  • Linear Correlation, or
  • Mutual Information (many methods)

Result: Big influence graphs where 2 genes are connected when they are “similar”. Millions of edges

Problem: Confusion between direct and indirect relationships

Significant co-expression

Several approaches exist to separate direct and indirect relationships:

Relevance Networks, ARACNe, C3NET, MRNET

Network size reduces 10-20 times

Influence graphs

Abstraction describing empirical co-expression between genes

  • Several noise sources can affect the result
  • Unmeasurable changes can affect regulation
  • They often do not represent physical interactions
  • They cannot be interpreted causally

Even so, they convey information about the underlying transcriptional mechanisms.

Pairs of co-expressed operons

We assume that all genes in each operon are co-expressed. This simplifies the analysis

I used Maximal Relevance/Minimal Redundancy criterion (MRNET) to determine co-expressed operons (edges of the influence network)

Result: Influence network with 61,506 edges.
6 of them are validated regulations

How to explain the other co-expressions?

Physical Interaction networks

A transcriptional regulatory network (TRN) is a physical model of the interactions

  • from regulators: genes coding for Transcription Factors

  • to target genes: those having a Binding Site for the TF in the promotor region

  • Modulate the global expression of genes through regulatory cascades.

Model: Explaining co-expression

Co-expression is explained by the existence of a common regulator acting on them directly or indirectly through a regulatory cascade. Either:

  • There is a directed path from one gene to the other. The first is regulating the last by a regulatory cascade.
  • None of the genes is regulating the other but both are co-regulated by a third gene.
    • This case is represented in the network by a v-shape: two paths from a common regulator to each co-regulated gene.

v-shapes

  • For a given pair of co-regulated genes A and B, we want to find the possible explanations for their co-regulation.
  • Thus, we call an explanation of A and B to any path from A to B or from B to A or any set of arcs forming a v-shape between them.

Experimental regulations explain few co-expressions

The network of experimentally validated regulations described in RegulonDB only explains 3,990 (6.5%) of the 61,506 observed co-expressions.

  • Only a few co-expressions were explained by a single validated arc
  • The rest could only be explained through regulatory cascades.

Predicted TRN can explain most co-expressions

Building a putative TRN

Putative TRNs are usually huge

Putative TRNs are usually huge, due to the low specificity of methods based on the sequence.

  • Putative TRN has 25,604 regulations
  • Predicted regulations may not be real
  • But contains regulations that explain 91.1% of co-expressions
  • A realistic subnetwork can be chosen in a biologically meaningful way

This is the main motivation of our model

Lombarde

our model

Graphical Illustration

Overview of LOMBARDE

The LOMBARDE method requires for the studied organism the following input:

  • a putative TRN represented by a weighted directed graph \(\mathcal G\), with vertices corresponding to genes and arcs connecting regulator genes to regulated ones.
    • An arc connects two genes if the first gene codes for a transcription factor that presumably binds in the promoting region of the second gene.
    • The \(p\)-value \(p_i\) associated with this arc reflects the confidence level of this prediction.
  • a set \(\mathcal O\) of pairs of co-expressed genes.

Overview of LOMBARDE

  • In a first stage LOMBARDE assigns to each arc a discrete cost \(w_i\) in a way such that the more confident arcs have lower cost. \[w_i = F(p_i)\]
  • LOMBARDE discretizes the \(p\)-values into \(k\) categories.
  • This allows to define the function \(Cost(S)\) for any subgraph \(S\) as the sum of the costs of its arcs. \[Cost(S)=\sum_{i\in S}w_i\].

Costs of arcs

To avoid “shortcuts”
we use costs that grow
exponentially

Better ten “good” steps
at cost 1
than one “weak” step
at cost 10

Overview of LOMBARDE

  • In a second stage LOMBARDE deciphers the co-expression of the pair \((gene_{1}, gene_{2})\in \mathcal O\) by identifying a common regulator \(gene_{3}\) which is connected to both \(gene_{1}\) and \(gene_{2}\) via regulatory cascades of high confidence.

  • In graph terms, a subgraph \(S\) is an v-shape for the pair \((gene_{1}, gene_{2})\) if \(S\) is the union of two independent paths from \(gene_{3}\) (the common regulator) to \(gene_{1}\) and to \(gene_{2}\).

Confident explanations

  • An v-shape for \((gene_{1}, gene_{2})\) is said to be confident if it is of minimum cost among all the explanations for the pair.

  • Our model transforms a parsimony cirteria into a graph minimization problem.

  • The output of LOMBARDE is a subgraph \(\mathcal L\) of \(\mathcal G\) built as the union of all confident explanations for each co-expressed pair in \(\mathcal O\).

Results

LOMBARDE results are biased towards validated regulatory interactions

  • The putative TRN for E.coli contains 25,604 arcs, 444 of them are experimentally validated.

  • After applying LOMBARDE most of its arcs are discarded, keeping only 4,922 (19.0%).

  • However, among the validated arcs, LOMBARDE is less aggressive, keeping 295 (66.4%) of them.

LOMBARDE results are biased towards validated regulatory interactions

  • This shows that the output of LOMBARDE is biased towards experimentally validated regulations.
    • An hypergeometric test confirms this bias, with an enrichment \(p\)-value under \(10^{-105}\).

LOMBARDE can complete partially known TRN

  • We also considered an extended TRN combining all E.coli validated regulations and all arcs in the putative TRN
  • Near 30% of non-validated arcs are replaced by a set of similar size where almost all arcs are validated.
  • There is a core of regulations preserved in LOMBARDE output

Even without experimental data results are good

Venn Diagram

\[\newcommand{\VVecoli}{E.coli}\newcommand{\GV}{G_V}\newcommand{\GAF}{\mathcal G}\newcommand{\GAV}{\mathcal G_e}\]

Summary of Results

Network Explained co-expressions Num. Vertices Num. Arcs Num. Arcs in \(\VVecoli\)
TRN built from E.coli 3,990 (6.5%) 823 1,652 1,652
E.coli ab initio \(\GAF\) 56,044 (91.1%) 2,390 25,604 444
Lombarde output \(\mathcal L\) 56,044 (91.1%) 2,336 4,922 295
E.coli extended \(\GAV\) 56,789 (92.3%) 2,434 26,812 1,652
Lombarde output \(\mathcal L_e\) 56,789 (92.3%) 2,370 4,374 1,520

LOMBARDE produces a topologically realistic TRN

  • Average degree (number of interactions per operon) of the putative TRN was 10.7
  • The value suggested in literature is in the range 1.5 to 2.0.
  • Average degree of LOMBARDE output was 2.1.
  • This is also close to the average degree in the network of validated regulations for E.coli, 2.0.

Degree distribution

  • The degree distribution (proportion of operons for each degree) in LOMBARDE output is similar to the network of validated regulations, meaning that they share some structural properties.

Global relevance of regulators can be evaluated using centrality indices

The network produced by LOMBARDE also contains most of the global regulators described for E.coli

Using the radiality index, we could rank the regulators on LOMBARDE output. Among the most relevant regulators in this network we recovered 10 of the known global regulators.

When LOMBARDE was applied to the extended input, the result recovers 18 of the known global regulators, 14 of them among the most relevant ones.

Core of predicted E.coli regulators

Ranking of predicted E.coli regulators

Gene name Ranking in literature Ranking for radiality index in Lombarde output for \(\GAF\) Ranking for radiality index in Lombarde output for \(\GAV\)
crp 1 25 1
ihfA 2 14 4
ihfB 3 16 5
fnr 4 1 6
fis 5 63 2
arcA 6 13 7
lrp 7 34 87
hns 8 14
narL 9 121 126
ompR 10 143 96
fur 11 7 8
phoB 12 9 25
cpxR 13 80 22
soxR 14 69 49
soxS 15 109 18
mtfA 16
cspA 17 42
rob 18 30 95
purR 19 39 47

Results for A.ferrooxidans

Results for A.ferrooxidans

  • 64 regulators identified
  • 19 of them have no known function
  • Enrchment of Nitrogen related regulators
    • Nitrogen fixation has been identified as a relevant factor in bioleaching (Levican et al, 2008)

Conclusion

LOMBARDE produces networks with realistic degree distributions, recovering and giving a central role to most of the global regulators described in literature.

In other words, LOMBARDE shapes the resulting network towards the structural characteristics of a true regulatory network.