**Source:** Uçarlı, Cüneyt, Liam J. McGuffin, Süleyman
Çaputlu, Andres Aravena, and Filiz Gürel. *“Genetic Diversity at the
Dhn3 Locus in Turkish_Hordeum Spontaneum_Populations with Comparative
Structural Analyses.”* Scientific Reports (2016) https://doi.org/10.1038/srep20966.

To study closely related genes or proteins

- to find conserved domains

To find the evolutionary relationships

- the base of phylogenetic trees

To identify shared patterns among related genes

- conserved sequences can be binding sites

We discussed *pairwise alignment*

Global, using

*Needleman & Wunsch*algorithm- also for semi-global alignment

Local, using

*Smith & Waterman*algorithm

We build a matrix with \(m_1\) rows and \(m_2\) columns

We write the sequence \(s_1\) in the
rows,

and the sequence \(s_2\) in the
columns

The computational cost is \(O(m_1 m_2)\)

(\(m_1\) is the length of \(s_1\), \(m_2\) is the length of \(s_2\))

To find the optimal alignment, we look for diagonals in the matrix that maximize the total score

Every cell \(M_{ij}\) in the matrix
has initially the value \[M_{ij}=\text{Score}_2(s_{1}[i],s_{2}[j])\]
where \(s_{1}[i]\) is the letter in
position \(i\) of sequence \(s_1\),

and \(s_{2}[j]\) is the letter in
position \(j\) of \(s_2\)

To aligning two sequences, we build a dot-plot matrix.

That is, a rectangle.

To align three sequences, we need a three-dimensional array.

That is, a cube.

Each cell \(M_{ijk}\) has value \[M_{ijk}=\text{Score}_3(s_{1}[i],s_{2}[j], s_{3}[k])\]

Usually, external gaps do not count, but internal gaps count

That is, these are semi-global alignments

Any path from a border to another border will be an alignment

We look for the *optimal* alignment

If the three sequences have length \(m_1, m_2,\) and \(m_3,\) then building the cube has cost \[O(m_1\cdot m_2\cdot m_3)\]

To simplify, we assume that all sequences have length \(m\)

Then the cost of three-wise alignment is \[O(m^3)\]

Following the same idea…

To align \(N\) sequences, we need a dot-plot in \(N\) dimensions

\[M_{i_1,\ldots,i_N}=\text{Score}_N(s_{1}[i_1],s_{2}[i_2],…,s_{N}[i_N])\]

Therefore, if the average sequence length is \(m,\) then the cost is \[O(m^N)\]

To fix ideas, assume that \(m=1000\)

(That is a typical size for a bacterial gene)

The computational cost is \(O(1000^N)\)

In other words, the cost is \(O(10^{3N})\)

Now assume that the computer can do one million comparisons each second

The number of seconds is then \[O(10^{3N-6})\]

Exercise: How many seconds will it take for 2, 4, 8, and 12 sequences?

Under these hypothesis we have this table

\(N\) | Seconds | In words |
---|---|---|

2 | \(10^0\) | 1 sec |

4 | \(10^6\) | 1 million seconds |

8 | \(10^{18}\) | 1 trillion/quintillion seconds |

12 | \(10^{30}\) | a lot of time |

Translate these numbers to days, years, etc.

(Approximate answer are OK. We only need one significant figure)

How do these numbers change if \(m\) changes?

What happens if the computers are 1000 times faster?

What is the largest multiple alignment that you can do in your life?

What is the largest number of sequences that can be aligned?

What can we do to align more sequences?

This is clearly too expensive, so we need heuristics

(i.e. solving a similar but simpler problem)

One common idea is to do a *progressive alignment*

- start with a pairwise alignment,
- and then align the rest one by one

There are several ways to simplify the original problem

Thus, there are many *approximate* solutions

The main differences are:

- How to decide what to align first?
- Can new sequences change the previous alignment?
- Can \(s_k\) change the alignment of \((s_1,…,s_{k-1})\)?

- How to use additional information (like 3D structure)?
- What is the formula for \(\text{Score}_k(s_{1}[i_1],…,s_{k}[i_k])\)?

- Clustal
- ClustalW, ClustalX, Clustal Omega

- T-Coffee
- 3D-Coffee

- MUSCLE
- MAFFT

Clustal was the first popular *multiple sequence aligner*

Versions: Clustal 1, Clustal 2, Clustal 3, Clustal 4, Clustal V, Clustal W, Clustal X, Clustal Ω

Only the last one is used today

First, there should be a scoring function

Clustal uses a simple one. The sum of all v/s all

\[ \begin{aligned} \text{Score}_k(s_{1}[i_1],…,s_{k}[i_k])= & \text{Score}_2(s_{1}[i_1],s_{2}[i_2]) + \\ & \text{Score}_2(s_{1}[i_1],s_{3}[i_3]) + \cdots+ \\ & \text{Score}_2(s_{k-1}[i_{k-1}],s_{k}[i_k]) \end{aligned} \]

where \(\text{Score}_2(s_{a}[i],s_{b}[j])\) is PAM, BLOSUM, or a similar substitution scoring matrix

We start by comparing sequences all-to-all

That is, comparing all pairs of sequences

We store them in a *distance matrix*

How many pairs can be done with \(N\) sequences?

Once we get all pairwise “distances” (that is, scores)

We make a tree by hierarchical clustering or neighbor joining

The *guide tree* is built without seeing the big picture

So it is not safe to assign any meaning to it

We will talk more about trees and build phylogenetic trees later

Clustal aligns the sequences following the guide tree

First, it aligns the more similar sequences

Then it adds the nearest sequence, and so on

These are semi-global alignments

Uses \(\text{Score}_k()\) when there are \(k\) sequences