We want to know how they come to be

Let’s think about bacteria

If an organism X evolves into two new organisms A and B, both new organisms share something in common

For example

```
X: TGGGGCAAGTCGGATCCAGATGGGCGCTAC
A: TGGGGCAAGTCGGATCCAGATGGGCGCTAT
B: TAGGGCAAGTCGGATCCAGATGGGCGCTAC
```

We would see evolution like this

So we only see the *modern* organisms

How to reconstruct the original tree, given the modern sequences

in YouTube

- Random mutations
- Selection
- Competition for the environment
- Bottleneck effect

- Coevolution

DNA replication is not 100% perfect

Mutations can be

- Substitutions
- Insertions
- Deletions
- Reorganizations

Not all mutations are “accepted”

Probably most mutations are lethal

We only see mutations that keeps the organism alive

Some mutations can give an advantage

Other mutations are

*neutral*

In the short term, all viable organisms are alive

In the long term, and when resources are scarce, some organisms do not survive

For example, some organisms may be more efficient in capturing food or using energy

- Some organisms have higher “fitness”

If the environment changes, the “fitness” changes

- There may be
*bottleneck*effects

- There may be

Evolution is more complex for sexual organisms

Some individuals do not pass their genes to the next generation, due to mate-selection

Mate-selection also evolves

We say that phenotype and peer-selection co-evolve

“Every morning in Africa, a gazelle wakes up, it knows it must run faster than the fastest lion or it will be killed.

“Every morning in Africa, a lion wakes up, it knows it must run faster than the slowest gazelle, or it will starve.

“It doesn’t matter whether you’re the lion or a gazelle-when the sun comes up, you’d better be running.”

For this class we will consider the 16S gene in bacteria

- Approx. 1500 nucleotides
- Highly conserved
- Most mutations are lethal
- Cell viability depends on 16S structure

- Asexual reproduction

Trees are a kind of networks where every node is connected and there are no loops

There are two kinds of nodes

- internal nodes
- they have “children”

- leaves
- They do not have “children”

We see only the leaves, we want to find the internal nodes, and the branches

To represent evolution we start with an initial node

It is the ancestor to all other nodes, and it has no ancestor

It has *two* arrows pointing to its “children”

The arrows start from the root and point to the leaves.

Looking only at the modern data, we cannot know which sequence existed before

That is, we cannot put an arrow between two nodes

We put a *link*, undirected, between nodes

These trees are called *unrooted*

Since we only see leaves, we cannot put arrows

So we cannot tell which internal node is the root

But, if we include a leave that we know is very distant from all the others, then we can find the root.

At least two groups of people have worked with networks: engineers and mathematicians

They use different words for the same objects

Network: graph

Node: vertex

Link: edge

Arrow: arc

The same tree can be drawn in several ways

The drawing is not important

The only important things are

- The tree
*topology*. That is, who is connected to who - The length of each arc (or edge)

there are basically three approaches

- Maximum parsimony
- smallest tree that explains all mutations

- Maximum likelihood
- most probable tree, using a probabilistic model

- Distance based
- forget the sequences, use only their distances

In all cases the input is a *multiple alignment* of all sequences

If we know the tree topology, we can count how many mutations are needed to match our data

But the number of trees is HUGE (\(n^{n-2}\))

So the search has to be done with heuristics

In some simulations the predicted tree may be very different from the real one

- We only know “the real tree” when we create it

It can be statistically inconsistent

- That is, adding more sequences sometimes makes a worse tree

An alternative is to find *the most probable tree*, given the available data

This method needs:

- A probabilistic model of evolution
- Looking at all the trees

So, again, we need an heuristic

Here we use the multiple alignment to calculate the distance between sequences. For example

a | b | c | d | e | |
---|---|---|---|---|---|

a | 0 | 17 | 21 | 31 | 23 |

b | 17 | 0 | 30 | 34 | 21 |

c | 21 | 30 | 0 | 28 | 39 |

d | 31 | 34 | 28 | 0 | 43 |

e | 23 | 21 | 39 | 43 | 0 |

We call it \(D_1.\) Then we forget about the sequences

Example data from Wikipedia

The smallest distance in \(D_1\) is \(D_1 (a,b)=17\)

We join \(a\) and \(b\) into a new node \((a,b)\), and update distances

\[ \begin{aligned} D_2((a,b),c)= & \frac{D_1(a,c) + D_1(b,c)}{2}=\frac{21+30}{2}=25.5\\ D_2((a,b),d)= & \frac{D_1(a,d) + D_1(b,d)}{2}=\frac{31+34}{2}=32.5\\ D_2((a,b),e)= & \frac{D_1(a,e) + D_1(b,e)}{2}=\frac{23+21}{2}=22 \end{aligned} \]

(a,b) | c | d | e | |
---|---|---|---|---|

(a,b) | 0 | 25.5 |
32.5 |
22 |

c | 25.5 |
0 | 28 |
39 |

d | 32.5 |
28 |
0 | 43 |

e | 22 |
39 |
43 |
0 |

now the smallest distance is \(D_2 ((a,b),e)=22\)

\[ \begin{aligned} D_3(((a,b),e),c)= & \frac{D_2((a,b),c) + D_2(e,c)}{2}=\frac{25.5 + 39}{2}=32.25\\ D_3(((a,b),e),d)= & \frac{D_2((a,b),d) + D_2(e,d)}{2}=\frac{32.5 + 43}{2}=37.75 \end{aligned} \]

((a,b),e) | c | d | |
---|---|---|---|

((a,b),e) | 0 | 32.25 |
37.75 |

c | 32.25 |
0 | 28 |

d | 37.75 |
28 |
0 |

\[ \begin{aligned} D_4((c,d),((a,b),e)) & = \frac{D_3(c,((a,b),e)) + D_3(d,((a,b),e))}{2}\\ & = \frac{32.25+37.75}{2}\\ & =35 \end{aligned} \]

((a,b),e) | (c,d) | |
---|---|---|

((a,b),e) | 0 | 35 |

(c,d) | 35 |
0 |

This is called *average linking*, or **W**eighted **P**air **G**roup **M**ethod with **A**rithmetic Mean

One problem: we mix groups of different size

Node *((a,b),e)* has three sequences, and *(c,d)* has two

“bigger nodes” should have more weight

Unweighted pair group method with arithmetic mean

The distance clusters \(\mathcal{A}\) and \(\mathcal{B}\), each of size \({|\mathcal{A}|}\) and \({|\mathcal{B}|}\), is the average of all distances \(d(x,y)\) between pairs of objects in \(\mathcal{A}\) and in \(\mathcal{B}\)

\[d_{(\mathcal{A} \cup \mathcal{B}),X} = \frac{|\mathcal{A}| \cdot d_{\mathcal{A},X} + |\mathcal{B}| \cdot d_{\mathcal{B},X}}{|\mathcal{A}| + |\mathcal{B}|}\]

\[ \begin{aligned} D_2((a,b),c)& =\frac{D_1(a,c) \times 1 + D_1(b,c) \times 1)}{1+1}=\frac{21+30}{2}=25.5\\ D_2((a,b),d)& =\frac{D_1(a,d) + D_1(b,d)}{2}=\frac{31+34}{2}=32.5\\ D_2((a,b),e)& =\frac{D_1(a,e) + D_1(b,e)}{2}=\frac{23+21}{2}=22 \end{aligned} \] This is the same as before

\(D_3(((a,b),e),c)=\frac{D_2((a,b),c) \times 2 + D_2(e,c) \times 1}{2+1}= \frac{25.5 \times 2 + 39 \times 1}{3}=30\) \(D_3(((a,b),e),d)=\frac{D_2((a,b),d) \times 2 + D_2(e,d) \times 1}{2+1}= \frac{32.5 \times 2 + 43 \times 1}{3}=36\)

((a,b),e) | c | d | |
---|---|---|---|

((a,b),e) | 0 | 30 |
36 |

c | 30 |
0 | 28 |

d | 36 |
28 |
0 |

One problem with these approaches is that all pairs are joined at the same distance from the common ancestor

But the mutation rate may be different for different branches

If we know the topology, we can find the branch lengths

We minimize the squared difference between observed distance and tree distance

\[\min_{d_{ij}} \sum_i\sum_j(D_{ij}-d_{ij})^2\]

But we still need to find the tree topology

This is one of the most recommended methods

Instead of joining the nearest nodes in the distance matrix,

we look into a new matrix \(Q\)

\[Q_{ij} = (n-2) D_{ij} -\sum_k D_{ik} -\sum_k D_{kj}\]

This “normalized” value can be negative

- calculate \(R_i = \sum_j D_{ij}\) for all \(i\)
- calculate \(Q_{ij} = (n-2) D_{ij} - R_i - R_j\)
- Find smallest \(Q_ij\)
- Join \(i\) and \(j\) into a new node \(u\) \[\begin{aligned} D_{iu} &= \frac{(n-2) D_{ij} + R_i -R_j}{2(n-2)}\\ D_{uk} &= \frac{1}{2}(D_{ik} + D_{jk} - D_{ij}) \end{aligned}\]
- Repeat until well done

It is hard to build

time machines, and we only get an approximate answer