# Intro to phylogeny

## Plan

• We will learn to build phylogenetic trees
• We start building trees based on the distances between species
• We assume that we know the distances
• Distances are hard to calculate
• We will study that later

## We have some elements to organize

We know the distance between elements

## We know the distance between elements

Distances are in a matrix that we call $$D_1$$

a b c d e
a 0 17 21 31 23
b 17 0 30 34 21
c 21 30 0 28 39
d 31 34 28 0 43
e 23 21 39 43 0

Example data from Wikipedia

## What is a distance?

What are its properties?

## Are all matrices a representation of some distance?

Which matrices can be representations of distances?

## Distance between nodes

The length of each edge $$(i,j)$$ is $$\text{len}(i,j)$$

We can calculate the distance between any pair of nodes

$D(i,u) = \begin{cases} \text{len}(i,u)&\text{if }i\text{ is neighbor of }u\\ \min_{j} (\text{len}(i,j) + D(k,u))&\text{otherwise } \end{cases}$

The minimum is taken considering all $$j$$ neighbors of $$i$$

## Example

a b c d e f
a 0 13 11 15 21 22
b 13 0 2 6 12 13
c 11 2 0 4 10 11
d 15 6 4 0 6 7
e 21 12 10 6 0 13
f 22 13 11 7 13 0

## Not all distances correspond to a nice graph

Let’s change only one value

a b c d e f
a 0 13 9 15 21 22
b 13 0 2 6 12 13
c 9 2 0 4 10 11
d 15 6 4 0 6 7
e 21 12 10 6 0 13
f 22 13 11 7 13 0

It is still a valid distance matrix, but cannot be drawn nicely

# Method 1

## Hierarchical clustering

The idea is to group similar nodes in the same branch

The smallest distance in $$D_1$$ is $$D_1 (a,b)=17$$

a b c d e
a 0 17 21 31 23
b 17 0 30 34 21
c 21 30 0 28 39
d 31 34 28 0 43
e 23 21 39 43 0

So, $$a$$ and $$b$$ are the closest elements

## We join the closest elements

Before joining, we have

We create a new node $$(a,b)$$

We connect $$a$$ and $$b$$ to $$(a,b)$$, splitting their distance

## Now we update the other distances

We build a new matrix $$D_2$$ with the average distance of each element to $$(a,b)$$

\begin{aligned} D_2((a,b),c)= & \frac{D_1(a,c) + D_1(b,c)}{2}=\frac{21+30}{2}=25.5\\ D_2((a,b),d)= & \frac{D_1(a,d) + D_1(b,d)}{2}=\frac{31+34}{2}=32.5\\ D_2((a,b),e)= & \frac{D_1(a,e) + D_1(b,e)}{2}=\frac{23+21}{2}=22 \end{aligned}

## Updated matrix

The matrix $$D_2$$ is

(a,b) c d e
(a,b) 0 25.5 32.5 22
c 25.5 0 28 39
d 32.5 28 0 43
e 22 39 43 0

(values in bold are new, the ones in italics did not change)

Now the smallest distance is $$D_2 ((a,b),e)=22$$.
We must join $$(a,b)$$ and $$e$$

## Correct distance for new node

The last slide shows that the average distance $$(a,e)$$ and $$(b,e)$$ is 22/2

But the new node $$(a,b)$$ is already at distance 17/2 from $$e$$

So the distance between $$(a,b)$$ and $$e$$ is

$\frac{D_2((a,b),e)}{2} - \frac{D_1(a,b)}{2} =\frac{22-17}{2} = 2.5$

Notice that this distance is used in the drawing but not in the matrix

## And we update the distances

\begin{aligned} D_3(((a,b),e),c)= & \frac{D_2((a,b),c) + D_2(e,c)}{2}\\ = & \frac{25.5 + 39}{2}=32.25\\ D_3(((a,b),e),d)= & \frac{D_2((a,b),d) + D_2(e,d)}{2}\\ = & \frac{32.5 + 43}{2}=37.75 \end{aligned}

## We have a new matrix

The matrix $$D_3$$ is

((a,b),e) c d
((a,b),e) 0 32.25 37.75
c 32.25 0 28
d 37.75 28 0

Now the closest elements are $$c$$ and $$d$$

The distance from $$c$$ and $$d$$ to the new node $$(c,d)$$ is 28/2

No correction is necessary, since there are no nodes below them

## We calculate the distance matrix $$D_4$$

We calculate the only remaining distance

\begin{aligned} D_4((c,d),((a,b),e)) & = \frac{D_3(c,((a,b),e)) + D_3(d,((a,b),e))}{2}\\ & = \frac{32.25+37.75}{2} =35 \end{aligned}

The new matrix is

((a,b),e) (c,d)
((a,b),e) 0 35
(c,d) 35 0

## Describing the tree

We can represent the complete tree by $(((a,b),e),(c,d))$

The parenthesis show how to connect every element

But we miss the distance of every element

We can write the distance to the parent after the node label

$(((a\colon D_a, b\colon D_b)\colon D_{ab},e\colon D_e)\colon D_{abe},(c\colon D_c,d\colon D_d)\colon D_{cd});$

## This is called Newick format

The resulting tree

can be written (including labels of internal nodes) as $(((a\colon 8.5, b\colon 8.5)w\colon 2.5,e\colon 11)v\colon 6.5,(c\colon 14,d\colon 14)u\colon 3.5)r;$

## This is hierarchical clustering

This is called Weighted Pair Group Method with Arithmetic Mean (WPGMA)

There are other hierarchical clustering methods, depending on how do we evaluate the distance between

• two single elements
• one element and a group
• two groups

## One problem

we mix groups of different size

## We mix groups of different size

Node ((a,b),e) has three sequences, and (c,d) has two

“bigger nodes” should have more weight

# Method 2

## Alternative: UPGMA

Unweighted pair group method with arithmetic mean

The distance between branch $$A$$ and $$B$$, each of size $${N_A}$$ and $${N_B}$$, is the average of all distances $$D(x,y)$$ between pairs of objects in $$A$$ and in $$B$$

$D((A,B),X) = \frac{N_A \cdot D(A,X) + N_B \cdot D(B,X)}{N_A + N_B}$

## Redoing our example

\begin{aligned} D_2((a,b),c)& =\frac{D_1(a,c) \times 1 + D_1(b,c) \times 1)}{1+1}\\ & =\frac{21+30}{2}=25.5\\ D_2((a,b),d)& =\frac{D_1(a,d) + D_1(b,d)}{2}=\frac{31+34}{2}=32.5\\ D_2((a,b),e)& =\frac{D_1(a,e) + D_1(b,e)}{2}=\frac{23+21}{2}=22 \end{aligned}

The first step is the same as before

## Next step is different

\begin{aligned} D_3(((a,b),e),c)&=\frac{D_2((a,b),c) \times 2 + D_2(e,c) \times 1}{2+1}=\\ & =\frac{25.5 \times 2 + 39 \times 1}{3}=30\\ D_3(((a,b),e),d)&=\frac{D_2((a,b),d) \times 2 + D_2(e,d) \times 1}{2+1}=\\ & =\frac{32.5 \times 2 + 43 \times 1}{3}=36 \end{aligned}

## Matrix $$D_3$$ in UPGMA

((a,b),e) c d
((a,b),e) 0 30 36
c 30 0 28
d 36 28 0

## UPGMA is much more used than WPGMA

In practice UPGMA is more realistic than WPGMA

But both have a problem:

The distances between leaves do not match the original distances

Moreover, the mutation rate may be different for different branches

# Method 3

## Minimization problem

If we know the tree topology, we can find the branches’ lengths

We minimize the squared difference between observed distance $$D_{ij}$$ and tree distance $$d_{ij}$$

$\min_{d_{ij}} \sum_{i,j}(D(i,j)-d_{ij})^2$

But we still need to find the tree topology, and that is a hard problem.

## Why neighbor joining formula

Let $$i$$ and $$j$$ be two siblings in a nice tree

\begin{aligned} D(a,b) =& D(a,c) + D(c,b)\qquad(\text{eq. }1)\\ D(a,e) =& D(a,c) + D(c,e)\qquad(\text{eq. }2)\\ D(b,e) =& D(b,c) + D(c,e)\qquad(\text{eq. }3)\\ \end{aligned}

## Result

$D(c,e) =\frac{D(a,e)+D(b,e)-D(a,b)}{2}$

So if we only know the distances between leaves $$a, b$$ and $$e,$$ and we add internal node $$c,$$ this is how we find the distance $$D(c,e)$$

Neighbor Joining is trying to make a nice tree

## Neighbor joining

This is an heuristic to solve the minimization problem

Instead of joining the nearest nodes in the distance matrix, we look into a new matrix $$Q$$

$Q(i,j) = (n-2) D(i,j) -\sum_k D(i,k) -\sum_k D(k,j)$

This “neighbor-joining” distance can be negative

## Method

• calculate $$R_i = \sum_j D(i,j)$$ for all $$i$$
• calculate $$Q(i,j) = (n-2) D(i,j) - R_i - R_j$$
• Find smallest $$Q(i,j)$$
• Join $$i$$ and $$j$$ into a new node $$u$$ \begin{aligned} D(i,u) &= \frac{(n-2) D(i,j) + R_i -R_j}{2(n-2)}\\ D(u,k) &= \frac{1}{2}(D(i,k) + D(j,k) - D(i,j)) \end{aligned}
• Repeat until well done

a b c d e
a 0 17 21 31 23
b 17 0 30 34 21
c 21 30 0 28 39
d 31 34 28 0 43
e 23 21 39 43 0

## First we calculate $$Q_1$$

For each $$i,j∈ \{a,b,c,d,e\}, i≠j,$$ we have $Q(i,j) = (n-2) D(i,j) - R_i - R_j$

a b c d e
a 0 -143 -147 -135 -149
b -143 0 -130 -136 -165
c -147 -130 0 -170 -127
d -135 -136 -170 0 -133
e -149 -165 -127 -133 0

The nearest elements are $$c$$ and $$d$$

## New node $$u$$ connects $$c$$ and $$d$$

\begin{aligned} D(c, u) =& \frac{D(c,d)}{2} + \frac{R_c -R_d}{2(5-2)}\\ =& 11\\ \end{aligned}

\begin{aligned} D(d, u) = & D(c,d) - D(c,u)\\ =&17\\ \end{aligned}

## Matrix D2

For each $$k∈ \{a,b,e\}$$ we have $D(u,k) = \frac{1}{2}(D(c,k) + D(d,k) - D(c,d))$

a b u e
a 0 17 12 23
b 17 0 18 21
u 12 18 0 27
e 23 21 27 0

a b u e
a 0 -74 -85 -77
b -74 0 -77 -85
u -85 -77 0 -74
e -77 -85 -74 0

## New node $$v$$ connects $$a$$ and $$u$$

\begin{aligned} D(a, v) =& 4.75\\ D(u, v) =& 7.25\\ \end{aligned}

v b e
v 0.0 11.5 19
b 11.5 0.0 21
e 19.0 21.0 0

v b e
0.0 -51.5 -51.5
-51.5 0.0 -51.5
-51.5 -51.5 0.0

## New node $$w$$ connects $$v$$ and $$b$$

\begin{aligned} D(v, w) = & 4.75\\ D(b, w) = & 6.75\\ \end{aligned}

## D4

w e
0.00 14.25
14.25 0.00

thus $$D(w, e) = 14.25$$

## Exercise

Redo all the calculation of these trees