Class 7. Trees representing distance

Bioinformatics

Andrés Aravena

November 30, 2022

Intro to phylogeny

Plan

We will learn to build phylogenetic trees
We start building trees based on the distances between species
We assume that we know the distances
Distances are hard to calculate
We will study that later

We have some elements to organize

We know the distance between elements

Distances are in a matrix that we call \(D_1\)

	a	b	c	d	e
a	0	17	21	31	23
b	17	0	30	34	21
c	21	30	0	28	39
d	31	34	28	0	43
e	23	21	39	43	0

Example data from Wikipedia

What is a distance?

What are its properties?

Are all matrices a representation of some distance?

Which matrices can be representations of distances?

Distance between nodes

The length of each edge \((i,j)\) is \(\text{len}(i,j)\)

We can calculate the distance between any pair of nodes

\[ D(i,u) = \begin{cases} \text{len}(i,u)&\text{if }i\text{ is neighbor of }u\\ \min_{j} (\text{len}(i,j) + D(k,u))&\text{otherwise } \end{cases} \]

The minimum is taken considering all \(j\) neighbors of \(i\)

Example

	a	b	c	d	e	f
a	0	13	11	15	21	22
b	13	0	2	6	12	13
c	11	2	0	4	10	11
d	15	6	4	0	6	7
e	21	12	10	6	0	13
f	22	13	11	7	13	0

Not all distances correspond to a nice graph

Let’s change only one value

	a	b	c	d	e	f
a	0	13	9	15	21	22
b	13	0	2	6	12	13
c	9	2	0	4	10	11
d	15	6	4	0	6	7
e	21	12	10	6	0	13
f	22	13	11	7	13	0

It is still a valid distance matrix, but cannot be drawn nicely

Method 1

Hierarchical clustering

The idea is to group similar nodes in the same branch

The smallest distance in \(D_1\) is \(D_1 (a,b)=17\)

	a	b	c	d	e
a	0	17	21	31	23
b	17	0	30	34	21
c	21	30	0	28	39
d	31	34	28	0	43
e	23	21	39	43	0

So, \(a\) and \(b\) are the closest elements

We join the closest elements

Before joining, we have

We create a new node \((a,b)\)

We connect \(a\) and \(b\) to \((a,b)\), splitting their distance

Now we update the other distances

We build a new matrix \(D_2\) with the average distance of each element to \((a,b)\)

\[ \begin{aligned} D_2((a,b),c)= & \frac{D_1(a,c) + D_1(b,c)}{2}=\frac{21+30}{2}=25.5\\ D_2((a,b),d)= & \frac{D_1(a,d) + D_1(b,d)}{2}=\frac{31+34}{2}=32.5\\ D_2((a,b),e)= & \frac{D_1(a,e) + D_1(b,e)}{2}=\frac{23+21}{2}=22 \end{aligned} \]

Updated matrix

The matrix \(D_2\) is

	(a,b)	c	d	e
(a,b)	0	25.5	32.5	22
c	25.5	0	28	39
d	32.5	28	0	43
e	22	39	43	0

(values in bold are new, the ones in italics did not change)

Now the smallest distance is \(D_2 ((a,b),e)=22\).
We must join \((a,b)\) and \(e\)

Correct distance for new node

The last slide shows that the average distance \((a,e)\) and \((b,e)\) is 22/2

But the new node \((a,b)\) is already at distance 17/2 from \(e\)

So the distance between \((a,b)\) and \(e\) is

\[ \frac{D_2((a,b),e)}{2} - \frac{D_1(a,b)}{2} =\frac{22-17}{2} = 2.5 \]

Notice that this distance is used in the drawing but not in the matrix

We update the tree again

And we update the distances

\[ \begin{aligned} D_3(((a,b),e),c)= & \frac{D_2((a,b),c) + D_2(e,c)}{2}\\ = & \frac{25.5 + 39}{2}=32.25\\ D_3(((a,b),e),d)= & \frac{D_2((a,b),d) + D_2(e,d)}{2}\\ = & \frac{32.5 + 43}{2}=37.75 \end{aligned} \]

We have a new matrix

The matrix \(D_3\) is

	((a,b),e)	c	d
((a,b),e)	0	32.25	37.75
c	32.25	0	28
d	37.75	28	0

Now the closest elements are \(c\) and \(d\)

The distance from \(c\) and \(d\) to the new node \((c,d)\) is 28/2

No correction is necessary, since there are no nodes below them

We join \(c\) and \(d\)

We calculate the distance matrix \(D_4\)

We calculate the only remaining distance

\[ \begin{aligned} D_4((c,d),((a,b),e)) & = \frac{D_3(c,((a,b),e)) + D_3(d,((a,b),e))}{2}\\ & = \frac{32.25+37.75}{2} =35 \end{aligned} \]

The new matrix is

	((a,b),e)	(c,d)
((a,b),e)	0	35
(c,d)	35	0

Finally we get a full tree

Describing the tree

We can represent the complete tree by \[(((a,b),e),(c,d))\]

The parenthesis show how to connect every element

But we miss the distance of every element

We can write the distance to the parent after the node label

\[(((a\colon D_a, b\colon D_b)\colon D_{ab},e\colon D_e)\colon D_{abe},(c\colon D_c,d\colon D_d)\colon D_{cd});\]

This is called Newick format

The resulting tree

can be written (including labels of internal nodes) as \[(((a\colon 8.5, b\colon 8.5)w\colon 2.5,e\colon 11)v\colon 6.5,(c\colon 14,d\colon 14)u\colon 3.5)r;\]

Exercise: encode this tree

This is hierarchical clustering

This is called Weighted Pair Group Method with Arithmetic Mean (WPGMA)

There are other hierarchical clustering methods, depending on how do we evaluate the distance between

two single elements
one element and a group
two groups

One problem

we mix groups of different size

We mix groups of different size

Node ((a,b),e) has three sequences, and (c,d) has two

“bigger nodes” should have more weight

Method 2

Alternative: UPGMA

Unweighted pair group method with arithmetic mean

The distance between branch \(A\) and \(B\), each of size \({N_A}\) and \({N_B}\), is the average of all distances \(D(x,y)\) between pairs of objects in \(A\) and in \(B\)

\[ D((A,B),X) = \frac{N_A \cdot D(A,X) + N_B \cdot D(B,X)}{N_A + N_B} \]

Redoing our example

\[ \begin{aligned} D_2((a,b),c)& =\frac{D_1(a,c) \times 1 + D_1(b,c) \times 1)}{1+1}\\ & =\frac{21+30}{2}=25.5\\ D_2((a,b),d)& =\frac{D_1(a,d) + D_1(b,d)}{2}=\frac{31+34}{2}=32.5\\ D_2((a,b),e)& =\frac{D_1(a,e) + D_1(b,e)}{2}=\frac{23+21}{2}=22 \end{aligned} \]

The first step is the same as before

Next step is different

\[ \begin{aligned} D_3(((a,b),e),c)&=\frac{D_2((a,b),c) \times 2 + D_2(e,c) \times 1}{2+1}=\\ & =\frac{25.5 \times 2 + 39 \times 1}{3}=30\\ D_3(((a,b),e),d)&=\frac{D_2((a,b),d) \times 2 + D_2(e,d) \times 1}{2+1}=\\ & =\frac{32.5 \times 2 + 43 \times 1}{3}=36 \end{aligned} \]

Matrix \(D_3\) in UPGMA

	((a,b),e)	c	d
((a,b),e)	0	30	36
c	30	0	28
d	36	28	0

Comparing WPGMA against UPGMA

UPGMA is much more used than WPGMA

In practice UPGMA is more realistic than WPGMA

But both have a problem:

The distances between leaves do not match the original distances

Moreover, the mutation rate may be different for different branches

Method 3

Minimization problem

If we know the tree topology, we can find the branches’ lengths

We minimize the squared difference between observed distance \(D_{ij}\) and tree distance \(d_{ij}\)

\[\min_{d_{ij}} \sum_{i,j}(D(i,j)-d_{ij})^2\]

But we still need to find the tree topology, and that is a hard problem.

Why neighbor joining formula

Let \(i\) and \(j\) be two siblings in a nice tree

\[ \begin{aligned} D(a,b) =& D(a,c) + D(c,b)\qquad(\text{eq. }1)\\ D(a,e) =& D(a,c) + D(c,e)\qquad(\text{eq. }2)\\ D(b,e) =& D(b,c) + D(c,e)\qquad(\text{eq. }3)\\ \end{aligned} \]

Result

\[D(c,e) =\frac{D(a,e)+D(b,e)-D(a,b)}{2}\]

So if we only know the distances between leaves \(a, b\) and \(e,\) and we add internal node \(c,\) this is how we find the distance \(D(c,e)\)

Neighbor Joining is trying to make a nice tree

Neighbor joining

This is an heuristic to solve the minimization problem

Instead of joining the nearest nodes in the distance matrix, we look into a new matrix \(Q\)

\[Q(i,j) = (n-2) D(i,j) -\sum_k D(i,k) -\sum_k D(k,j)\]

This “neighbor-joining” distance can be negative

Neighbor joining

Method

calculate \(R_i = \sum_j D(i,j)\) for all \(i\)
calculate \(Q(i,j) = (n-2) D(i,j) - R_i - R_j\)
Find smallest \(Q(i,j)\)
Join \(i\) and \(j\) into a new node \(u\) \[ \begin{aligned} D(i,u) &= \frac{(n-2) D(i,j) + R_i -R_j}{2(n-2)}\\ D(u,k) &= \frac{1}{2}(D(i,k) + D(j,k) - D(i,j)) \end{aligned} \]
Repeat until well done

We begin with the same initial condition

	a	b	c	d	e
a	0	17	21	31	23
b	17	0	30	34	21
c	21	30	0	28	39
d	31	34	28	0	43
e	23	21	39	43	0

First we calculate \(Q_1\)

For each \(i,j∈ \{a,b,c,d,e\}, i≠j,\) we have \[Q(i,j) = (n-2) D(i,j) - R_i - R_j\]

	a	b	c	d	e
a	0	-143	-147	-135	-149
b	-143	0	-130	-136	-165
c	-147	-130	0	-170	-127
d	-135	-136	-170	0	-133
e	-149	-165	-127	-133	0

The nearest elements are \(c\) and \(d\)

New node \(u\) connects \(c\) and \(d\)

\[ \begin{aligned} D(c, u) =& \frac{D(c,d)}{2} + \frac{R_c -R_d}{2(5-2)}\\ =& 11\\ \end{aligned} \]

\[ \begin{aligned} D(d, u) = & D(c,d) - D(c,u)\\ =&17\\ \end{aligned} \]

Matrix D2

For each \(k∈ \{a,b,e\}\) we have \[D(u,k) = \frac{1}{2}(D(c,k) + D(d,k) - D(c,d))\]

	a	b	u	e
a	0	17	12	23
b	17	0	18	21
u	12	18	0	27
e	23	21	27	0

Matrix Q2

	a	b	u	e
a	0	-74	-85	-77
b	-74	0	-77	-85
u	-85	-77	0	-74
e	-77	-85	-74	0

New node \(v\) connects \(a\) and \(u\)

\[ \begin{aligned} D(a, v) =& 4.75\\ D(u, v) =& 7.25\\ \end{aligned} \]

Matrix D3

	v	b	e
v	0.0	11.5	19
b	11.5	0.0	21
e	19.0	21.0	0

Matrix Q3

v	b	e
0.0	-51.5	-51.5
-51.5	0.0	-51.5
-51.5	-51.5	0.0

New node \(w\) connects \(v\) and \(b\)

\[ \begin{aligned} D(v, w) = & 4.75\\ D(b, w) = & 6.75\\ \end{aligned} \]

D4

w	e
0.00	14.25
14.25	0.00

thus \(D(w, e) = 14.25\)

We finally \(w\) and \(e\)

Exercise

Redo all the calculation of these trees

	a	b	c	d	e	f
a	0	13	11	15	21	22
b	13	0	2	6	12	13
c	11	2	0	4	10	11
d	15	6	4	0	6	7
e	21	12	10	6	0	13
f	22	13	11	7	13	0

	a	b	c	d	e	f
a	0	13	9	15	21	22
b	13	0	2	6	12	13
c	9	2	0	4	10	11
d	15	6	4	0	6	7
e	21	12	10	6	0	13
f	22	13	11	7	13	0

	a	b	c	d	e	f
a	0	13	11	15	21	22
b	13	0	2	6	12	13
c	11	2	0	4	10	11
d	15	6	4	0	6	7
e	21	12	10	6	0	13
f	22	13	11	7	13	0

	a	b	c	d	e	f
a	0	13	9	15	21	22
b	13	0	2	6	12	13
c	9	2	0	4	10	11
d	15	6	4	0	6	7
e	21	12	10	6	0	13
f	22	13	11	7	13	0

	a	b	c	d	e	f
a	0	13	11	15	21	22
b	13	0	2	6	12	13
c	11	2	0	4	10	11
d	15	6	4	0	6	7
e	21	12	10	6	0	13
f	22	13	11	7	13	0

	a	b	c	d	e	f
a	0	13	9	15	21	22
b	13	0	2	6	12	13
c	9	2	0	4	10	11
d	15	6	4	0	6	7
e	21	12	10	6	0	13
f	22	13	11	7	13	0