Class 9: Phylogenetic Trees

Bioinformatics

Andrés Aravena

December 14, 2024

We know sequences today.
We want to know how they come to be

Let’s think about bacteria

If an organism X evolves into two new organisms A and B, both new organisms share something in common

If we had a time machine…

We would see evolution like this

But we do not have a time machine

So we only see the modern organisms

The question is

How to reconstruct the original tree, given the modern sequences?

How evolution works

See it working

in YouTube

How evolution works

Random mutations
Selection
Competition for the environment
- Bottleneck effect
Coevolution

Random mutations

DNA replication is not 100% perfect

Mutations can be

Substitutions
Insertions
Deletions
Reorganizations

Selection

Not all mutations are “accepted”
Probably most mutations are lethal
We only see mutations that keeps the organism alive
- Some mutations can give an advantage
- Other mutations are neutral

Competition

In the short term, all viable organisms are alive
In the long term, and when resources are scarce, some organisms do not survive
For example, some organisms may be more efficient in capturing food or using energy
- Some organisms have higher “fitness”
If the environment changes, the “fitness” changes
- There may be bottleneck effects

Coevolution

Evolution is more complex for sexual organisms
Some individuals do not pass their genes to the next generation, due to mate-selection
Mate-selection also evolves
We say that phenotype and peer-selection co-evolve

Coevolution between predator and prey

“Every morning in Africa, a gazelle wakes up, it knows it must run faster than the fastest lion or it will be killed.
“Every morning in Africa, a lion wakes up, it knows it must run faster than the slowest gazelle, or it will starve.
“It doesn’t matter whether you’re the lion or a gazelle-when the sun comes up, you’d better be running.”

Molecular evolution

Looking at only one gene

For this class we will consider the 16S gene in bacteria

Approx. 1500 nucleotides
Highly conserved
- Most mutations are lethal
- Cell viability depends on 16S structure
Asexual reproduction

Unrooted trees

Looking only at the modern data, we cannot know which sequence existed before

That is, we cannot put an arrow between two nodes

We put a link, undirected, between nodes

These trees are called unrooted

Outgroups point to the root

Since we only see leaves, we cannot put arrows

So we cannot tell which internal node is the root

But, if we include a leave that we know is very distant from all the others, then we can find the root.

Illustration: Unrooted tree

Illustration: Unrooted tree with outgroup

Illustration: Rooted tree

Essence of a tree

The same tree can be drawn in several ways

The drawing is not important

The only important things are

The tree topology. That is, who is connected to who
The length of each arc (or edge)

Reconstructing the tree

There are basically three approaches

Maximum parsimony
- smallest tree that explains all mutations
Maximum likelihood
- most probable tree, using a probabilistic model
Distance based
- forget the sequences, use only their distances

In all cases the input is a multiple alignment of all sequences

Maximum parsimony

If we know the tree topology, we can count how many mutations are needed to match our data

Based on Bininda-Emonds et al (1998) “Properties of Matrix Representation with Parsimony Analyses”. Systematic Biology, 47(3), 497–508.

So we just have to test all trees and see which is the best one

There are too many trees

But the number of trees is HUGE

\[n^{n-2}\]

So the search has to be done with heuristics

Maximum likelihood

An alternative is to find the most probable tree, given the available data

This method needs:

A probabilistic model of evolution
Looking at all the trees

So, again, we need an heuristic

Distance methods

We already discussed them

UPGMA
Neighbor Joining

Here we use the Hamming or Levenstein distance between sequences after Multiple sequence alignment

Distance and time

Hamming Distance is not time

Mutation rate is not proportional to time

Multiple substitutions of the same base cannot be observed

TATCGACTTCGGCAT
TATCGACGTCGGCAT
TATCGACTTCGGCAT
TATCGACTACGGCAT
TATCGACTTCGGCAT

So we underestimate the divergence time

Max DNA mutation ≈ 75%

Substitution model

There are different models to find time given distance

The simplest one is Jukes-Cantor (1969)

Kimura (1980)

Tamura (1992)

Real v/s observed mutations

According to the Jukes Cantor model

\[R = -\frac{3}{4}\ln\left(1-\frac{4}{3}D/L\right)\]

Here \(D/L\) is the percentage of sites with different nucleotides
(Hamming Distance over Length)

\(R\) is the expected number of mutations that really happened

In summary

It is hard to build time machines, and we only get an approximate answer

Class 9: Phylogenetic Trees

Bioinformatics

Andrés Aravena

December 14, 2024

We know sequences today.We want to know how they come to be

If we had a time machine…

But we do not have a time machine

The question is

How evolution works

See it working

How evolution works

Random mutations

Selection

Competition

Coevolution

Coevolution between predator and prey

Molecular evolution

Looking at only one gene

Unrooted trees

Outgroups point to the root

Illustration: Unrooted tree

Illustration: Unrooted tree with outgroup

Illustration: Rooted tree

Essence of a tree

Reconstructing the tree

Maximum parsimony

So we just have to test all trees and see which is the best one

There are too many trees

Other problem with parsimony methods

Maximum likelihood

Distance methods

Distance and time

Hamming Distance is not time

Max DNA mutation ≈ 75%

Substitution model

The simplest one is Jukes-Cantor (1969)

Kimura (1980)

Tamura (1992)

Real v/s observed mutations

In summary

We know sequences today.
We want to know how they come to be