20 April 2016

## My name is Andrés Aravena

Türkçe bilmiyorum 😟

I am

• Assistant Professor at Istanbul University
• Mathematical Engineer, U. of Chile
• PhD Informatics, U Rennes 1, France
• PhD Mathematical Modeling, U. of Chile
• not a Biologist
• but an Applied Mathematician who can speak “biologist language”

## I’ve worked on

• Very big and very small computers
• Servers and Networks Management
• Between 2003 and 2014 I was the chief research engineer
• on the main bioinformatic group in my country (MATHomics)
• in the top research center (Center for Mathematical Modeling)
• in the top university of my country (University of Chile)

world

## Chile

chile

Near 17 million people

Universities ranks similar to Turkish ones

Spanish colony 500 years ago (so language is Spanish)

Independent Republic 200 years ago

First Latin American country to recognize Turkish republic

OECD member, same as Turkey

Everyday life very similar to Turkey

## How can you make your country better

### The question is

with data science?

## Chilean Economy: Exports

exports

1st world producer of copper

2nd world producer of salmon

Fruits: peaches, grapes, apples, avocado

Wine: exported worldwide

Bioinformatics can improve all these industries

Official data for 2014. Banco Central de Chile

## Science for understanding biological processes

• Grape (Sultaniye):
• control seed and grape size without hormones
• Wine:
• quality control on exported wine,
• avoid secondary fermentation

## Salmon genomic projects

• effect of diet on metabolism,
• selection of stress tolerant families.
• Whole genome sequencing
• 10M dollars project,
• harder than human genome
• Chile, Canada and Norway

## Copper is heated and melt

### to separate it from other compounds

This is very expensive

## Solution: Bioleaching

### The use of bacteria to extract elements from ore

Bioleaching is much better that melting copper

• Reduced contamination
• Cheaper

The goal is to understand and improve the involved bacteria so this technology can be used extensively

Enables building new mines

It is like discovering petrol reserves for the country

## Understanding the mining bacterias

### my previous life

For 10 years I worked at MATHomics, a bioinformatic lab

• In the main chilean university
• For the biggest copper company (state owned)

Developing tools (hardware and software) to

• detect, identify and quantify these bacterias
• understand how they do their job
• try to plan how to improve them

## Bio-identification

One of the key tools we learned was how to identify microorganisms in complex samples

We used it to

• monitor copper mining bacteria
• control wine quality
• Detect parasites of salmon

We got 5 patents granted in 12 countries

## Metagenomics

### genomic of the ecosystem

Microorganism live in the most diverse environments. They are the key to:

• develop new biotechnology
• manage our natural resources
• improve our health
• understand our past

But only 5% of them can be grown in the lab

## Diversity of microorganisms

Microorganisms can live in extreme environments

• To survive there they produce proteins that can have industrial application
• We want to identify these proteins

Microorganisms are essential to human life

• 90% of the cells in your body are not “human”
• Most of them are essential to our health
• Our digestion depends on them

They are the foundation of all ecosystems

## Big data on biology

Since we cannot isolate them,

• we read all DNA from the environment
• we cluster similar sequences together
• and then we analyze them

Currently we use this approach to explore archeological data

• what did our ancestors eat?
• which diseases they had?
• What was the clima then?

## Everybody can do it

There is already a lot of data available, including

• Oceanic samples worldwide (last paper yield 7.2 Terabytes)
• Human gut microbiota
• Extreme environments: hot, cold and acid

Most scientific journals require depositing the raw data on public repositories

## Molecular Biology

Focus on things that can not be observed on the microscope

• small molecules (Metabolites)
• ~100 atoms
• large molecules (Proteins)
• ~10.000 to 100.000 atoms
• Nucleic acids (DNA, RNA)
• ~1010 atoms

## Proteins are complex molecules

• Proteins are biology’s workhorses-its “nanomachines.”
• Proteins help your body break down food into energy, regulate your moods, and fight disease.
• To carry out these important functions, they assemble themselves, or “fold.”
https://folding.stanford.edu/

## Proteins are complex molecules

### made of simple pieces

• Each protein is a chain of amino-acids (LEGO pieces)
• There are 20 types of amino-acids
• We can abstract the chemical nature of these molecules and look them as sequences of symbols
• Each protein corresponds to a word in an alphabet of 20 symbols
• Length between 20 and 1000 letters
• In the cell the protein will fold and adopt a specific shape

## Approach 1: Computational Biology

Given the sequence of a protein

• what is it shape of the molecule?
• how does it interact with other molecules?

Optimization problem: Find the 3D position of each atom that minimizes the energy

It requires a huge amount of computer power

## The Molecular Biology Dogma

• Some events can trigger production of RNA and proteins.
• It is usually assumed that protein concentration is proportional to RNA concentration.

## DNA is a “program” for making proteins

• Each chromosome is a double-chain of nucleotides
• There are 4 types of nucleotides: A, C, G, T
• Again we handle it as a chain of symbols
• The length is between 105 and 108 letters
• There is an A on one strand iff there is a T in the other. Same with C and G.
• Some sub-words of DNA encode the “recipe” to make proteins
• these sub-words are called genes

## Technical problems managed with data science

• Genome Assembly: a graph traversal problem. Each “piece” is a vertex of a graph. There is an edge when the two “pieces” overlap. How do we traverse the graph to get the “text”?

• Finding genes in the DNA: Find the “words” in the text. There are no “spaces”. Modelled by a hidden Markov chain

• What do each gene do: If we have found all “words”, what is the meaning? Besides protein folding we have other tools

## Homology

Most genes are similar (homolog) to genes on other species. This homology is determined by an edit distance.

We can “transform” a gene into another by

• substitution: ACGT → ACTT
• insertion: ACGT → ACTGT
• deletion: ACGT → ACT

Each edition has a cost. The distance is the minimal total cost.

The significance of the homology is the probability of the distance under a “null hypothesis”

## Modeling Metabolism

### with Flux Balance Analysis

Some of the proteins are part of chemical reactions. They are called enzymes

Each enzyme catalyzes a reaction, with a known stoichiometry

Every reaction gives an equation

Add boundary conditions and you have a model to predict:

• how the cell reacts to environmental changes
• what genes have to change to increase production

## What’s next?

There are many other areas where data science plays a role

• clustering genes by their expression
• deciphering the regulation of genes
• finding which metabolites are produced and how
• understanding the interaction between different species
• simulation to design cheaper experiments

## We have an opportunity

Technological advance has changed the way science is made

But not everybody realizes it!

We can use data science to have a real impact

We can resolve mysteries of Nature

We can improve the quality of life