Data Science in Science

20 April 2016

My name is Andrés Aravena

Türkçe bilmiyorum 😟

I am

Assistant Professor at Istanbul University
Mathematical Engineer, U. of Chile
PhD Informatics, U Rennes 1, France
PhD Mathematical Modeling, U. of Chile
not a Biologist
but an Applied Mathematician who can speak “biologist language”

I’ve worked on

Very big and very small computers
Servers and Networks Management
Between 2003 and 2014 I was the chief research engineer
- on the main bioinformatic group in my country (MATHomics)
- in the top research center (Center for Mathematical Modeling)
- in the top university of my country (University of Chile)

I come from Chile

the last corner of the world

world

Chile

chile

Near 17 million people

Universities ranks similar to Turkish ones

Spanish colony 500 years ago (so language is Spanish)

Independent Republic 200 years ago

First Latin American country to recognize Turkish republic

OECD member, same as Turkey

Everyday life very similar to Turkey

Life in Chile is very similar to life in Turkey

We even watch the same soap operas

How can you make your country better

The question is

with data science?

Chilean Economy: Exports

exports

1st world producer of copper

2nd world producer of salmon

Fruits: peaches, grapes, apples, avocado

Wine: exported worldwide

Bioinformatics can improve all these industries

Official data for 2014. Banco Central de Chile

Science for understanding biological processes

Grape (Sultaniye):
- control seed and grape size without hormones
Wine:
- quality control on exported wine,
- avoid secondary fermentation

Salmon genomic projects

effect of diet on metabolism,
selection of stress tolerant families.
Whole genome sequencing
- 10M dollars project,
- harder than human genome
- Chile, Canada and Norway

Mining Industry

Copper is heated and melt

to separate it from other compounds

This is very expensive …

… and contaminant

(this smoke is sulphuric acid)

Solution: Bioleaching

The use of bacteria to extract elements from ore

Bioleaching is much better that melting copper

Reduced contamination
Cheaper

The goal is to understand and improve the involved bacteria so this technology can be used extensively

Enables building new mines

It is like discovering petrol reserves for the country

Understanding the mining bacterias

my previous life

For 10 years I worked at MATHomics, a bioinformatic lab

In the main chilean university
For the biggest copper company (state owned)

Developing tools (hardware and software) to

detect, identify and quantify these bacterias
understand how they do their job
try to plan how to improve them

This was a national strategic project

so the president visited us

Current project

What I am doing in Turkey

Bio-identification

One of the key tools we learned was how to identify microorganisms in complex samples

We used it to

monitor copper mining bacteria
control wine quality
Detect parasites of salmon

We got 5 patents granted in 12 countries

Metagenomics

genomic of the ecosystem

Microorganism live in the most diverse environments. They are the key to:

develop new biotechnology
manage our natural resources
improve our health
understand our past

But only 5% of them can be grown in the lab

Diversity of microorganisms

Microorganisms can live in extreme environments

To survive there they produce proteins that can have industrial application
We want to identify these proteins

Microorganisms are essential to human life

90% of the cells in your body are not “human”
Most of them are essential to our health
Our digestion depends on them

They are the foundation of all ecosystems

Big data on biology

Since we cannot isolate them,

we read all DNA from the environment
we cluster similar sequences together
and then we analyze them

Currently we use this approach to explore archeological data

what did our ancestors eat?
which diseases they had?
What was the clima then?

Everybody can do it

There is already a lot of data available, including

Oceanic samples worldwide (last paper yield 7.2 Terabytes)
Human gut microbiota
Extreme environments: hot, cold and acid

Most scientific journals require depositing the raw data on public repositories

NCBI: http://www.ncbi.nlm.nih.gov/
EBI: http://www.ebi.org/

Molecular Biology 101

What are we talking about

Molecular Biology

Focus on things that can not be observed on the microscope

small molecules (Metabolites)
- ~100 atoms
large molecules (Proteins)
- ~10.000 to 100.000 atoms
Nucleic acids (DNA, RNA)
- ~10¹⁰ atoms

Proteins are complex molecules

Proteins are biology’s workhorses-its “nanomachines.”
Proteins help your body break down food into energy, regulate your moods, and fight disease.
To carry out these important functions, they assemble themselves, or “fold.”

https://folding.stanford.edu/

Proteins are complex molecules

made of simple pieces

Each protein is a chain of amino-acids (LEGO pieces)
There are 20 types of amino-acids
We can abstract the chemical nature of these molecules and look them as sequences of symbols
Each protein corresponds to a word in an alphabet of 20 symbols
Length between 20 and 1000 letters
In the cell the protein will fold and adopt a specific shape

Approach 1: Computational Biology

Given the sequence of a protein

what is it shape of the molecule?
how does it interact with other molecules?

Optimization problem: Find the 3D position of each atom that minimizes the energy

It requires a huge amount of computer power

You can do it at home

Stanford University has a distributed computing initiative at

https://folding.stanford.edu/

(Much like SETI@home)

There is also a game to do it as a puzzle

https://fold.it/portal/

But how do you get the protein sequence? Where do they come from?

The Molecular Biology Dogma

Some events can trigger production of RNA and proteins.
It is usually assumed that protein concentration is proportional to RNA concentration.

DNA is a “program” for making proteins

Each chromosome is a double-chain of nucleotides
There are 4 types of nucleotides: A, C, G, T
Again we handle it as a chain of symbols
The length is between 10⁵ and 10⁸ letters
There is an A on one strand iff there is a T in the other. Same with C and G.
Some sub-words of DNA encode the “recipe” to make proteins
- these sub-words are called genes

Sequencing genomes is cheap

Technical problems managed with data science

Genome Assembly: a graph traversal problem. Each “piece” is a vertex of a graph. There is an edge when the two “pieces” overlap. How do we traverse the graph to get the “text”?
Finding genes in the DNA: Find the “words” in the text. There are no “spaces”. Modelled by a hidden Markov chain
What do each gene do: If we have found all “words”, what is the meaning? Besides protein folding we have other tools

Homology

Most genes are similar (homolog) to genes on other species. This homology is determined by an edit distance.

We can “transform” a gene into another by

substitution: ACGT → ACTT
insertion: ACGT → ACTGT
deletion: ACGT → ACT

Each edition has a cost. The distance is the minimal total cost.

The significance of the homology is the probability of the distance under a “null hypothesis”

Modeling Metabolism

with Flux Balance Analysis

Some of the proteins are part of chemical reactions. They are called enzymes

Each enzyme catalyzes a reaction, with a known stoichiometry

Every reaction gives an equation

Add boundary conditions and you have a model to predict:

how the cell reacts to environmental changes
what genes have to change to increase production

What’s next?

There are many other areas where data science plays a role

clustering genes by their expression
deciphering the regulation of genes
finding which metabolites are produced and how
understanding the interaction between different species
simulation to design cheaper experiments

We have an opportunity

Technological advance has changed the way science is made

But not everybody realizes it!

We can use data science to have a real impact

We can resolve mysteries of Nature

We can improve the quality of life

It is not rocket science

It is not heart surgery

Thank you

andres.aravena@gmail.com