Molecular Biology in the Information Era

My name is Andrés Aravena

Türkçe bilmiyorum 😟

I am

New Assistant Professor at Molecular Biology and Genomics Department
Mathematical Engineer, U. of Chile
PhD Informatics, U Rennes 1, France
PhD Mathematical Modeling, U. of Chile
not a Biologist
but an Applied Mathematician who can speak “biologist language”

I will speak about

What I’ve done before so you can understand why I’m here
What I’m doing now at Istanbul University
What I foresee from my “outsider” point of view

The Past, Present and Future

Facts, opinion and guess

I’ve worked on

Big and small computers
Telecommunication Networks
Between 2003 and 2014 I was the chief research engineer
- on the main bioinformatic group in my country
- in the top research center (CMM)
- in the top university (University of Chile)
- of my country

I come from Chile

world

Chile

chile

Small country of ~17 million people

Universities ranks similar to Turkish ones

Spanish colony 500 years ago (so language is Spanish)

Independent Republic 200 years ago

First Latin American country to recognize Turkish republic

OECD member

Everyday life very similar to Turkey

Chileans like Turkish soap operas

The most successful soap opera last year was Bin Bir Gece

Chilean Economy: Exports

1st world producer of copper

2nd world producer of salmon

Fruits: peaches, grapes, apples, avocado

Wine: exported worldwide

Official data for 2014

How can we improve these industries

using Molecular Biology and Bioinformatics?

The natural question was

Fruits

Peach and Grapes

Gene expression analysis for industrial applications:

Peach: response to cold stress
Grapefruit: development related to seed and grape size (Sultaniye)

Fruits

Peach

http://www7.uc.cl/sw_educ/agronomia/desorden_fruta/html/fichas/durazno/harino/har_origen.htm

Chilean Peach (Prunus persica) is exported to USA and Europe

Cold storage can change the texture of the fruit

We identified the genes involved in the stress response by:

EST sequencing
Microarrays

Fruits

Grapes

For wine and to eat as dessert, like Sultaniye

We want big grapes and small seeds

but grape size depends on hormones produced by the seed

Which genes are involved on seed and grape growth?

Strategy:

Gene expression analysis using microarrays,
Whole genome sequencing.

Fishes

Salmon

Salmon

Farmed salmons are feed with cheap vegetal protein But wild salmons eat animal protein

How is salmon’s metabolism affected by the diet? Which genes change their expression because the changes in food?

Gene expression analysis using microarrays
Fish selection for breeding using microarrays (patent pending)

Fishes

Salmon Genomic Sequence

… and sequencing of whole Salmo salar genome

(10 million dollars project)

Wine

Chilean wine travels long distances to final markets

Any yeast contamination means big economic loses (people stops buying all Chilean brands)

Quality control is usually done growing samples for 3 days But time is expensive: penalty for shipping delays

We designed qPCR method for rapid detection of yeast contamination

It is currently used by one major wine producer in Chile. It may be sold to Roche.

Mining industry

molecular biology to extract copper

A little chemistry: Copper is part of a compound, with Sulfur and Iron. Ferric acid separates it.

Cu₂S + 4Fe³⁺ ⟶ 2Cu²⁺ + 4Fe²⁺ + S

Resulting Cu²⁺ is soluble and is recovered.

But all Fe³⁺ transforms to Fe²⁺ and reaction stops

There are bacteria that “eat” e^- and keep the reaction going on

Fe²⁺ ⟶ Fe³⁺ + e^-

Why is it important?

The biological method is much better that the standard one

Reduced contamination
Cheaper

The goal is to understand and improve the involved bacteria so this technology can be used extensively

It enables building new mines

It is like discovering petrol reserves for the country

Most of the results are still industrial secret

We had a research contract with the main mining company

State owned, big enough to pay for long term research

Few papers, many patents

Bioidentification

Monitoring the presence of good bacteria

We need to control the “ecosystem” on the mine

Molecular Biology methods are fast, sensible and reliable

They can be used in place: metagenomic approach. No culture

Key problem: Design probes that match a taxonomic branch, not a specific strain

The probes should be tolerant to mutations that occur in environmental samples with many strains

Classical tools don’t work on big scales

Design of probes for complex samples

I designed and built a solution using a super-computer

Calculation tool one day on 32 processors (one processor month)

Resulting probes worked as expected

They can be used on qPCR or in microarrays.

Automatic Interpretation of Results

using a Statistical Classification Model

Publications

The microarray was published in N. Ehrenfeld, A. Aravena, A. Reyes-Jara, N. Barreto, R. Assar, A. Maass, P. Parada, Design and use of oligonucleotide microarrays for identification of Biomining microorganisms. Advanced Materials Research 71-73 (2009) 155-158.

Patents

The method and the probes have been patented in

USA, Number: US 7 853 408 B2, Date: 14/12/2010;
South Africa, Number: 2006/06828, Date: 26/03/2008;
Australia, Number: 2006203551, Date: 15/09/2011;
Mexico, Number: PXMX 32/2006, Date: November 2012.
Peru, Number: PE 5838, Date: 29/10/2010;
Chine, Number: 200810095172.6, Date: 2013;
Chile, Number: DPI-660-2007, Date: 06/05/2013;
Argentina, Number: AR056179

Functional genomics

How does the bacteria work?

To improve the process we need to see inside the black box. We sequenced the complete genome of 3 bacteria

Acidithiobacillus ferrooxidans
Acidithiobacillus thiooxidans
Leptospirillum ferrooxidans

We paid over USD $150K. Today is USD $5K

Hint: Sequence assembly requires a big computer. It does not work on a regular PC

What do we learn from the DNA sequence?

We used Hidden Markov Models and Pattern Matching techniques to determine the genes and their functions

We learned that

Acidithiobacillus thiooxidans had all the machinery to build flagella
Acidithiobacillus ferrooxidans has a region where all genes do not have orthologous
We identified transcription factors and enzymes

which was not knew before
It covers 10% of the genome

Modeling Metabolism

We predict which genes code enzymes

Each enzyme catalyzes a reaction, with a known stoichiometry

Every reaction gives an equation

All equations plus boundary conditions give model to predict metabolite concentration

We can predict how the cell adapts to environmental changes

Modeling Regulation

From the genome sequence we can predict which genes code for transcription factors and they bind

They form a putative regulatory network.

But current methods produce too many false positives

We expected ~4K regulations. We got 25K regulations.

I integrate this model with microarray data to find the “most probable” regulatory network using a parsimony criterium

Systems Biology

beyond Bioinformatics

A very active research area that aim to understand the cell as a system with complex interactions

The focus is not on the genes, is on the genome

The key is to understand networks

regulatory
metabolic
signaling
protein-protein-interaction

Why Computers in Molecular Biology and Genetics?

The present

DNA is digital information

All experimental values in science are measured with an observational error. (e.g. temperature is 10.2 ± 0.05°C, pressure is 101215 ± 125 Pa)

Except genetic sequences: Nucleotides are either A, C, T or G.

There is no “average” or “intermediate case”

So is natural to use computers and information theory to model DNA

but there is another reason …

The sequencing of the human genome, made public by the president of USA, captured the attention of everybody.

Science converges to Molecular Biology

Physicists, mathematicians, computer scientist and engineers, turned their attention to molecular biology questions.

They come looking with new eyes and creating new theoretical and practical tools.

Molecular Biology has always interacted with other disciplines

Just consider the word “Biochemistry”

Internet makes Molecular Biology theory accessible to more people

Before Internet times

top science was accessible only to researchers with money to
- make complex experiments or
- buy expensive books and journals
finding references took several weeks by regular mail
Professors had the only copy of the textbooks

Today

all journals are accessible on-line
references are download in minutes at low cost
- free when the article is Open Access
experimental results of each article are also free

Anyone can analyze this data

Structured data is easy to process to discover new knowledge.

The software for this meta-analysis is also Open Source

Scientist can adapt the program internal code to solve their specific question

Anyone can download these programs without cost.

If the analysis requires big computational power you can rent it at low cost

You don’t need your own super-computer

You can rent Cloud computers

Companies like Amazon.com and Google sell their spare computer power at low prices

This enables researchers to carry computations that would be impossible otherwise.

The World is Flat

This democratization of knowledge provides an exciting challenge.

Rich countries have no longer the monopoly of knowledge.

We can be players in the big leagues, on a leveled surface.

We can read the same books and the same articles, use the same machines and the same programs.

Anyone could make the new scientific breakthrough, either in New York, New Delhi or Istanbul.

But the same opportunity presents to everyone else.

There are more PhD students than ever

And many of them will be on Molecular Biology

More players come to the game

Emerging economies push up the number of researchers worldwide

India graduates more than a million engineers each year. Many of them in biotechnology

Egypt has 35.000 PhD students and Israel 10.000.

Many of them will find jobs in Molecular Biology companies or academia

and China, Korea, Ukrania

Hays, Thomas. 2011. “PhDs: Israel Also Trains Plenty.” Nature 473 (7347). Nature Publishing Group: 284–84.

How will we be different?

Success of Molecular Biology generates Big Data

Advances in molecular biology technology has produced

new generation sequencers
microarrays
mass spectrometers
real-time PCR.

They produce

reproducible experimental results
in big volumes
at low cost

Data production costs is falling

The first bacterial genomic sequence was published in Science journal.

Today it would be just a shot communication.

National Human Genome Research Institute. http://genome.gov/sequencingcosts

Extracting Information from Raw Data

Surviving the Data Tsunami

In a few years we passed from lack of data to excess of it

We need to learn how to extract biological meaning from big volumes of data

Classical methods are not enough

What is significant? What is the “null hypothesis”?

If we don’t fully analyze our own experimental data, someone else will do

And they will publish it

The plan

what we will teach

Teaching “Introduction to Data Science”

The students will learn

how to handle experimental data
how to communicate with scientists of other data-oriented disciplines
how to produce publication quality reports with reproducible results
How to get raw data, extracting relevant information, filter it using several selection criteria.
How to store and retrieve it in efficient and useful ways.
How to transform it, organize it, categorize it, display, show and understand the results.

Teaching “Scientific Computing”

Teach Python and BioPython to analyze, model, evaluate and predict the behavior of genomic and molecular biology entities.

The students should be able to interact with high end servers, use command line tools and be comfortable in computing environments others than Microsoft Windows.

Tools include Unix command line tools, SQL and the R statistical package.

The student should be able to understand how computer networks work and what are their limitations.

The idea is no to be experts on computers, but to have the concepts and language to work in interdisciplinary groups

Let’s start learning Data Science

To test these ideas we start next week an

Introduction to Data Science Workshop

The mathematical tools can be explored together with the biological context, so they make sense and are easier to learn.

I will give you a link at the end of this talk.

If you are interested visit the webpage and send an email.

after all, maybe I’m just crazy

Every normal student is capable of good mathematical reasoning if attention is directed to activities of his interest

Jean Piaget, 1976
Swiss psychologist and philosopher

A Secret

You can also learn at home

Everything we will show is available on the Internet

You just need to look for it

But it is in English

Translation takes too long

Translated science is obsolete science

The Future

My personal prediction

It is hard to make predictions, especially about the future

Danish proverb

Molecular Biology has become mainstream

Genomic tools are also used outside academia.

Several companies provide “personalized DNA services”.

23andMe, partially owned by Google.
The Genographic project, created by the National Geographic Society and IBM.

Both offer to trace ancestry and migrations of the human population. Any person can know which are his true origins.

Example

Molecular Biology will follow the path of computers

Today PCR thermocyclers are expensive devices found in universities and research centers, very much like desktop computers were in the 70’s and 80’s.

Nowadays computers are low-cost and found everywhere.

Will the same happen with PCR?

PCR future

Today only a few companies produce PCR thermocyclers, just like smartphones such as the iPhone and Samsung.

Nevertheless you can see them everywhere.

And this is a big opportunity for creators of software applications.

The value is in the apps. Ask Nokia or Blackberry

A computer on every desk and in every home, all running Microsoft software

Bill Gates,
Microsoft’s founding mission.

PCR is the new PC

Gates set this goal in the late 70’s, when it was not obvious if people would even see a computer in their lives.

PCR technology is now in the same state that Personal Computers were in 1975. If PCR machines become inexpensive,

and there is “a PCR on every desk and home”,
in hospitals,
restaurants
and high schools,

then who will be making “software apps” for them?

If PCR machines are available everywhere

applications can be:

Determining ancestry (e.g. race horses, farm animals, fishes)
Detection of unwanted organisms
Marker-assisted breeding
Food quality control (e.g. in an university canteen)
Security and control of Genetically Modified Organisms
Polymorphism detection
Clinical diagnosis
Personalized medicine
Police forensic analysis

Software for PCR

the specific parameters of an application

DNA extraction protocols
Primers design
Amplification protocols
Detection methods

I think we should prepare our students to make these “apps”.

They should have easy access to low-cost thermocyclers, use them frequently and creatively.

Then, like in the computer industry, they may create completely new applications that we cannot foresee now.

New tools for new science

New Instruments trigger advances in Molecular Biology

and in other sciences

They are usually named according to their inventor

Galileo created modern science when he made his own telescope
Newton also invented a new kind of telescope, still used today
Bunsen enabled spectrometry analysis with his burner
Svedberg ultracentrifugue (16S)
Sanger DNA sequencing method
Southern blot method for specific DNA detection
PCR to amplify DNA samples

Notice that most of these inventors got Nobel prizes for their contributions.

Scientific Instrumentation

I propose to create a course on “Scientific Instrumentation” using initially software tools.

Making instruments is now “software”, not craftsmanship.

We can understand this with a biological analogy.

Designs in digital files are like genes.
3D printers are like ribosomes, producing physical versions of the design.
Online collaboration is like the evolution: designs are changed to improve their fitness.