- We will learn to build phylogenetic trees
- We start building trees based on the distances between species
- We assume that we know the distances
- Distances are hard to calculate
- We will study that later

For instance in a matrix \(D_1.\)

a | b | c | d | e | |
---|---|---|---|---|---|

a | 0 | 17 | 21 | 31 | 23 |

b | 17 | 0 | 30 | 34 | 21 |

c | 21 | 30 | 0 | 28 | 39 |

d | 31 | 34 | 28 | 0 | 43 |

e | 23 | 21 | 39 | 43 | 0 |

Example data from Wikipedia

What are its properties?

The idea is to group *similar* nodes in the same branch

The smallest distance in \(D_1\) is \(D_1 (a,b)=17\)

a | b | c | d | e | |
---|---|---|---|---|---|

a | 0 | 17 | 21 | 31 | 23 |

b | 17 | 0 | 30 | 34 | 21 |

c | 21 | 30 | 0 | 28 | 39 |

d | 31 | 34 | 28 | 0 | 43 |

e | 23 | 21 | 39 | 43 | 0 |

So, \(a\) and \(b\) are the closest elements

Before joining, we have

We create a new node \((a,b)\)

We connect \(a\) and \(b\) to \((a,b)\), splitting their distance

We build a new matrix \(D_2\) with the average distance of each element to \((a,b)\)

\[ \begin{aligned} D_2((a,b),c)= & \frac{D_1(a,c) + D_1(b,c)}{2}=\frac{21+30}{2}=25.5\\ D_2((a,b),d)= & \frac{D_1(a,d) + D_1(b,d)}{2}=\frac{31+34}{2}=32.5\\ D_2((a,b),e)= & \frac{D_1(a,e) + D_1(b,e)}{2}=\frac{23+21}{2}=22 \end{aligned} \]

The matrix \(D_2\) is

(a,b) | c | d | e | |
---|---|---|---|---|

(a,b) | 0 | 25.5 |
32.5 |
22 |

c | 25.5 |
0 | 28 |
39 |

d | 32.5 |
28 |
0 | 43 |

e | 22 |
39 |
43 |
0 |

(values in **bold** are new, the ones in
*italics* did not change)

Now the smallest distance is \(D_2
((a,b),e)=22\).

We must join \((a,b)\) and \(e\)

\[ \begin{aligned} D_3(((a,b),e),c)= & \frac{D_2((a,b),c) + D_2(e,c)}{2}\\ = & \frac{25.5 + 39}{2}=32.25\\ D_3(((a,b),e),d)= & \frac{D_2((a,b),d) + D_2(e,d)}{2}\\ = & \frac{32.5 + 43}{2}=37.75 \end{aligned} \]

The matrix \(D_3\) is

((a,b),e) | c | d | |
---|---|---|---|

((a,b),e) | 0 | 32.25 |
37.75 |

c | 32.25 |
0 | 28 |

d | 37.75 |
28 |
0 |

Now the closest elements are \(c\) and \(d\)

We calculate the only remaining distance

\[ \begin{aligned} D_4((c,d),((a,b),e)) & = \frac{D_3(c,((a,b),e)) + D_3(d,((a,b),e))}{2}\\ & = \frac{32.25+37.75}{2} =35 \end{aligned} \]

The new matrix is

((a,b),e) | (c,d) | |
---|---|---|

((a,b),e) | 0 | 35 |

(c,d) | 35 |
0 |

We can represent the complete tree by \[(((a,b),e),(c,d))\]

The parenthesis show how to connect every element

But we miss the *distance* of every element

We can write the distance to the parent after the node label

\[(((a\colon D_a, b\colon D_b)\colon D_{ab},e\colon D_e)\colon D_{abe},(c\colon D_c,d\colon D_d)\colon D_{cd});\]

The resulting tree

can be written (including labels of internal nodes) as \[(((a\colon 8.5, b\colon 8.5)w\colon 2.5,e\colon 11)v\colon 6.5,(c\colon 14,d\colon 14)u\colon 3.5)r;\]

This is called **W**eighted **P**air
**G**roup **M**ethod with
**A**rithmetic **M**ean (WPGMA)

There are other hierarchical clustering methods, depending on how do we evaluate the distance between

- two single elements
- one element and a group
- two groups

Node *((a,b),e)* has three sequences, and *(c,d)* has
two

“bigger nodes” should have more weight

Unweighted pair group method with arithmetic mean

The distance between branch \(A\) and \(B\), each of size \({N_A}\) and \({N_B}\), is the average of all distances \(D(x,y)\) between pairs of objects in \(A\) and in \(B\)

\[D((A,B),X) = \frac{N_A \cdot D(A,X) + N_B \cdot D(B,X)}{N_A + N_B}\]

\[\begin{aligned} D_2((a,b),c)& =\frac{D_1(a,c) \times 1 + D_1(b,c) \times 1)}{1+1}\\ & =\frac{21+30}{2}=25.5\\ D_2((a,b),d)& =\frac{D_1(a,d) + D_1(b,d)}{2}=\frac{31+34}{2}=32.5\\ D_2((a,b),e)& =\frac{D_1(a,e) + D_1(b,e)}{2}=\frac{23+21}{2}=22 \end{aligned}\]

The first step is the same as before

\[ \begin{aligned} D_3(((a,b),e),c)&=\frac{D_2((a,b),c) \times 2 + D_2(e,c) \times 1}{2+1}=\\ & =\frac{25.5 \times 2 + 39 \times 1}{3}=30\\ D_3(((a,b),e),d)&=\frac{D_2((a,b),d) \times 2 + D_2(e,d) \times 1}{2+1}=\\ & =\frac{32.5 \times 2 + 43 \times 1}{3}=36 \end{aligned} \]

((a,b),e) | c | d | |
---|---|---|---|

((a,b),e) | 0 | 30 |
36 |

c | 30 |
0 | 28 |

d | 36 |
28 |
0 |

In practice UPGMA is more realistic than WPGMA

But both have a problem:

The distances between leaves do not match the original distances

Moreover, the mutation rate may be different for different branches

If we know the tree topology, we can find the branches’ lengths

We minimize the squared difference between observed distance \(D_{ij}\) and tree distance \(d_{ij}\)

\[\min_{d_{ij}} \sum_{i,j}(D_{ij}-d_{ij})^2\]

But we still need to find the tree topology, and that is a *hard
problem*.

This is an heuristic to solve the minimization problem

Instead of joining the nearest nodes in the distance matrix,

we look into a new matrix \(Q\)

\[Q_{ij} = (n-2) D_{ij} -\sum_k D_{ik} -\sum_k D_{kj}\]

This “neighbor-joining” distance can be negative

- calculate \(R_i = \sum_j D_{ij}\) for all \(i\)
- calculate \(Q_{ij} = (n-2) D_{ij} - R_i - R_j\)
- Find smallest \(Q_ij\)
- Join \(i\) and \(j\) into a new node \(u\) \[\begin{aligned} D_{iu} &= \frac{(n-2) D_{ij} + R_i -R_j}{2(n-2)}\\ D_{uk} &= \frac{1}{2}(D_{ik} + D_{jk} - D_{ij}) \end{aligned}\]
- Repeat until well done

a | b | c | d | e | |
---|---|---|---|---|---|

a | 0 | 17 | 21 | 31 | 23 |

b | 17 | 0 | 30 | 34 | 21 |

c | 21 | 30 | 0 | 28 | 39 |

d | 31 | 34 | 28 | 0 | 43 |

e | 23 | 21 | 39 | 43 | 0 |

For each \(i,j∈ \{a,b,c,d,e\}, i≠j,\) we have \[Q(i,j) = (n-2) D(i,j) - R_i - R_j\]

a | b | c | d | e | |
---|---|---|---|---|---|

a | 0 | -143 | -147 | -135 | -149 |

b | -143 | 0 | -130 | -136 | -165 |

c | -147 | -130 | 0 | -170 |
-127 |

d | -135 | -136 | -170 |
0 | -133 |

e | -149 | -165 | -127 | -133 | 0 |

The nearest elements are \(c\) and \(d\)

\[\begin{aligned} D(c, u) =& \frac{D(c,d)}{2} + \frac{R_c -R_d}{2(5-2)}\\ =& 11\\ D(d, u) = & D(c,d) - D(c,u)\\ =&17\\ \end{aligned}\]

For each \(k∈ \{a,b,e\}\) we have \[D(u,k) = \frac{1}{2}(D(c,k) + D(d,k) - D(c,d))\]

a | b | u | e | |
---|---|---|---|---|

a | 0 | 17 | 12 | 23 |

b | 17 | 0 | 18 | 21 |

u | 12 | 18 | 0 | 27 |

e | 23 | 21 | 27 | 0 |

a | b | u | e | |
---|---|---|---|---|

a | 0 | -74 | -85 | -77 |

b | -74 | 0 | -77 | -85 |

u | -85 | -77 | 0 | -74 |

e | -77 | -85 | -74 | 0 |

\[\begin{aligned} D(a, v) =& 4.75\\ D(u, v) =& 7.25\\ \end{aligned}\]

v | b | e | |
---|---|---|---|

v | 0.0 | 11.5 | 19 |

b | 11.5 | 0.0 | 21 |

e | 19.0 | 21.0 | 0 |

v | b | e |
---|---|---|

0.0 | -51.5 | -51.5 |

-51.5 | 0.0 | -51.5 |

-51.5 | -51.5 | 0.0 |

\[\begin{aligned} D(v, w) = & 4.75\\ D(b, w) = & 6.75\\ \end{aligned}\]

w | e |
---|---|

0.00 | 14.25 |

14.25 | 0.00 |

thus \(D(w, e) = 14.25\)

Redo all the calculation of these trees