Intro to phylogeny

Plan

• We will learn to build phylogenetic trees
• We start building trees based on the distances between species
• We assume that we know the distances
• Distances are hard to calculate
• We will study that later

We know the distance between elements

For instance in a matrix $$D_1.$$

a b c d e
a 0 17 21 31 23
b 17 0 30 34 21
c 21 30 0 28 39
d 31 34 28 0 43
e 23 21 39 43 0

Example data from Wikipedia

What is a distance?

What are its properties?

Method 1

Hierarchical clustering

The idea is to group similar nodes in the same branch

The smallest distance in $$D_1$$ is $$D_1 (a,b)=17$$

a b c d e
a 0 17 21 31 23
b 17 0 30 34 21
c 21 30 0 28 39
d 31 34 28 0 43
e 23 21 39 43 0

So, $$a$$ and $$b$$ are the closest elements

We join the closest elements

Before joining, we have

We create a new node $$(a,b)$$

We connect $$a$$ and $$b$$ to $$(a,b)$$, splitting their distance

Now we update the other distances

We build a new matrix $$D_2$$ with the average distance of each element to $$(a,b)$$

\begin{aligned} D_2((a,b),c)= & \frac{D_1(a,c) + D_1(b,c)}{2}=\frac{21+30}{2}=25.5\\ D_2((a,b),d)= & \frac{D_1(a,d) + D_1(b,d)}{2}=\frac{31+34}{2}=32.5\\ D_2((a,b),e)= & \frac{D_1(a,e) + D_1(b,e)}{2}=\frac{23+21}{2}=22 \end{aligned}

Updated matrix

The matrix $$D_2$$ is

(a,b) c d e
(a,b) 0 25.5 32.5 22
c 25.5 0 28 39
d 32.5 28 0 43
e 22 39 43 0

(values in bold are new, the ones in italics did not change)

Now the smallest distance is $$D_2 ((a,b),e)=22$$.
We must join $$(a,b)$$ and $$e$$

And we update the distances

\begin{aligned} D_3(((a,b),e),c)= & \frac{D_2((a,b),c) + D_2(e,c)}{2}\\ = & \frac{25.5 + 39}{2}=32.25\\ D_3(((a,b),e),d)= & \frac{D_2((a,b),d) + D_2(e,d)}{2}\\ = & \frac{32.5 + 43}{2}=37.75 \end{aligned}

We have a new matrix

The matrix $$D_3$$ is

((a,b),e) c d
((a,b),e) 0 32.25 37.75
c 32.25 0 28
d 37.75 28 0

Now the closest elements are $$c$$ and $$d$$

We calculate the distance matrix $$D_4$$

We calculate the only remaining distance

\begin{aligned} D_4((c,d),((a,b),e)) & = \frac{D_3(c,((a,b),e)) + D_3(d,((a,b),e))}{2}\\ & = \frac{32.25+37.75}{2} =35 \end{aligned}

The new matrix is

((a,b),e) (c,d)
((a,b),e) 0 35
(c,d) 35 0

Describing the tree

We can represent the complete tree by $(((a,b),e),(c,d))$

The parenthesis show how to connect every element

But we miss the distance of every element

We can write the distance to the parent after the node label

$(((a\colon D_a, b\colon D_b)\colon D_{ab},e\colon D_e)\colon D_{abe},(c\colon D_c,d\colon D_d)\colon D_{cd});$

This is called Newick format

The resulting tree

can be written (including labels of internal nodes) as $(((a\colon 8.5, b\colon 8.5)w\colon 2.5,e\colon 11)v\colon 6.5,(c\colon 14,d\colon 14)u\colon 3.5)r;$

This is hierarchical clustering

This is called Weighted Pair Group Method with Arithmetic Mean (WPGMA)

There are other hierarchical clustering methods, depending on how do we evaluate the distance between

• two single elements
• one element and a group
• two groups

One problem: we mix groups of different size

Node ((a,b),e) has three sequences, and (c,d) has two

“bigger nodes” should have more weight

Method 2

Alternative: UPGMA

Unweighted pair group method with arithmetic mean

The distance between branch $$A$$ and $$B$$, each of size $${N_A}$$ and $${N_B}$$, is the average of all distances $$D(x,y)$$ between pairs of objects in $$A$$ and in $$B$$

$D((A,B),X) = \frac{N_A \cdot D(A,X) + N_B \cdot D(B,X)}{N_A + N_B}$

Redoing our example

\begin{aligned} D_2((a,b),c)& =\frac{D_1(a,c) \times 1 + D_1(b,c) \times 1)}{1+1}\\ & =\frac{21+30}{2}=25.5\\ D_2((a,b),d)& =\frac{D_1(a,d) + D_1(b,d)}{2}=\frac{31+34}{2}=32.5\\ D_2((a,b),e)& =\frac{D_1(a,e) + D_1(b,e)}{2}=\frac{23+21}{2}=22 \end{aligned}

The first step is the same as before

Next step is different

\begin{aligned} D_3(((a,b),e),c)&=\frac{D_2((a,b),c) \times 2 + D_2(e,c) \times 1}{2+1}=\\ & =\frac{25.5 \times 2 + 39 \times 1}{3}=30\\ D_3(((a,b),e),d)&=\frac{D_2((a,b),d) \times 2 + D_2(e,d) \times 1}{2+1}=\\ & =\frac{32.5 \times 2 + 43 \times 1}{3}=36 \end{aligned}

Matrix $$D_3$$ in UPGMA

((a,b),e) c d
((a,b),e) 0 30 36
c 30 0 28
d 36 28 0

UPGMA is much more used than WPGMA

In practice UPGMA is more realistic than WPGMA

But both have a problem:

The distances between leaves do not match the original distances

Moreover, the mutation rate may be different for different branches

Method 3

Minimization problem

If we know the tree topology, we can find the branches’ lengths

We minimize the squared difference between observed distance $$D_{ij}$$ and tree distance $$d_{ij}$$

$\min_{d_{ij}} \sum_{i,j}(D_{ij}-d_{ij})^2$

But we still need to find the tree topology, and that is a hard problem.

Neighbor joining

This is an heuristic to solve the minimization problem

Instead of joining the nearest nodes in the distance matrix,
we look into a new matrix $$Q$$

$Q_{ij} = (n-2) D_{ij} -\sum_k D_{ik} -\sum_k D_{kj}$

This “neighbor-joining” distance can be negative

Method

• calculate $$R_i = \sum_j D_{ij}$$ for all $$i$$
• calculate $$Q_{ij} = (n-2) D_{ij} - R_i - R_j$$
• Find smallest $$Q_ij$$
• Join $$i$$ and $$j$$ into a new node $$u$$ \begin{aligned} D_{iu} &= \frac{(n-2) D_{ij} + R_i -R_j}{2(n-2)}\\ D_{uk} &= \frac{1}{2}(D_{ik} + D_{jk} - D_{ij}) \end{aligned}
• Repeat until well done

a b c d e
a 0 17 21 31 23
b 17 0 30 34 21
c 21 30 0 28 39
d 31 34 28 0 43
e 23 21 39 43 0

First we calculate $$Q_1$$

For each $$i,j∈ \{a,b,c,d,e\}, i≠j,$$ we have $Q(i,j) = (n-2) D(i,j) - R_i - R_j$

a b c d e
a 0 -143 -147 -135 -149
b -143 0 -130 -136 -165
c -147 -130 0 -170 -127
d -135 -136 -170 0 -133
e -149 -165 -127 -133 0

The nearest elements are $$c$$ and $$d$$

New node $$u$$ connects $$c$$ and $$d$$

\begin{aligned} D(c, u) =& \frac{D(c,d)}{2} + \frac{R_c -R_d}{2(5-2)}\\ =& 11\\ D(d, u) = & D(c,d) - D(c,u)\\ =&17\\ \end{aligned}

Matrix D2

For each $$k∈ \{a,b,e\}$$ we have $D(u,k) = \frac{1}{2}(D(c,k) + D(d,k) - D(c,d))$

a b u e
a 0 17 12 23
b 17 0 18 21
u 12 18 0 27
e 23 21 27 0

a b u e
a 0 -74 -85 -77
b -74 0 -77 -85
u -85 -77 0 -74
e -77 -85 -74 0

New node $$v$$ connects $$a$$ and $$u$$

\begin{aligned} D(a, v) =& 4.75\\ D(u, v) =& 7.25\\ \end{aligned}

v b e
v 0.0 11.5 19
b 11.5 0.0 21
e 19.0 21.0 0

v b e
0.0 -51.5 -51.5
-51.5 0.0 -51.5
-51.5 -51.5 0.0

New node $$w$$ connects $$v$$ and $$b$$

\begin{aligned} D(v, w) = & 4.75\\ D(b, w) = & 6.75\\ \end{aligned}

D4

w e
0.00 14.25
14.25 0.00

thus $$D(w, e) = 14.25$$

Exercise

Redo all the calculation of these trees