# Bioinformatics

## Multiple Sequence alignment

Source: Uçarlı, Cüneyt, Liam J. McGuffin, Süleyman Çaputlu, Andres Aravena, and Filiz Gürel. “Genetic Diversity at the Dhn3 Locus in Turkish Hordeum Spontaneum Populations with Comparative Structural Analyses.” Scientific Reports (2016) https://doi.org/10.1038/srep20966.

## Why Multiple Sequence Alignment?

• To study closely related genes or proteins
• to find conserved domains
• To find the evolutionary relationships
• the base of phylogenetic trees
• To identify shared patterns among related genes
• conserved sequences can be binding sites

## In the previous chapter…

We discussed pairwise alignment

• Global, using Needleman & Wunsch algorithm
• also for semi-global alignment
• Local, using Smith & Waterman algorithm

## Both methods build a dot-plot matrix

We build a matrix with $$m_1$$ rows and $$m_2$$ columns

We write the sequence $$s_1$$ in the rows,
and the sequence $$s_2$$ in the columns

The computational cost is $$O(m_1 m_2)$$

($$m_1$$ is the length of $$s_1$$, $$m_2$$ is the length of $$s_2$$)

## Filling the dot-plot matrix

To find the optimal alignment, we look for diagonals in the matrix that maximize the total score

Every cell $$M_{ij}$$ in the matrix has initially the value $M_{ij}=\text{Score}_2(s_{1}[i],s_{2}[j])$ where $$s_{1}[i]$$ is the letter in position $$i$$ of sequence $$s_1$$,
and $$s_{2}[j]$$ is the letter in position $$j$$ of $$s_2$$

## Pairwise to three-wise alignment

To aligning two sequences, we build a dot-plot matrix.
That is, a rectangle.

To align three sequences, we need a three-dimensional array.
That is, a cube.

Each cell $$M_{ijk}$$ has value $M_{ijk}=\text{Score}_3(s_{1}[i],s_{2}[j], s_{3}[k])$

## Then we find the diagonals

Usually, external gaps do not count, but internal gaps count

That is, these are semi-global alignments

Any path from a border to another border will be an alignment

We look for the optimal alignment

## Cost of three-wise alignment

If the three sequences have length $$m_1, m_2,$$ and $$m_3,$$ then building the cube has cost $O(m_1\cdot m_2\cdot m_3)$

## Always look for the big picture

To simplify, we assume that all sequences have length $$m$$

Then the cost of three-wise alignment is $O(m^3)$

## Multiple alignment

Following the same idea…

To align $$N$$ sequences, we need a dot-plot in $$N$$ dimensions

$M_{i_1,\ldots,i_N}=\text{Score}_N(s_{1}[i_1],s_{2}[i_2],…,s_{N}[i_N])$

Therefore, if the average sequence length is $$m,$$ then the cost is $O(m^N)$

## How much is that?

To fix ideas, assume that $$m=1000$$

(That is a typical size for a bacterial gene)

The computational cost is $$O(1000^N)$$

In other words, the cost is $$O(10^{3N})$$

## How many seconds

Now assume that the computer can do one million comparisons each second

The number of seconds is then $O(10^{3N-6})$

Exercise: How many seconds will it take for 2, 4, 8, and 12 sequences?

## How much is that?

Under these hypothesis we have this table

$$N$$ Seconds In words
2 $$10^0$$ 1 sec
4 $$10^6$$ 1 million seconds
8 $$10^{18}$$ 1 trillion/quintillion seconds
12 $$10^{30}$$ a lot of time

## Exercise 1

Translate these numbers to days, years, etc.

(Approximate answer are OK. We only need one significant figure)

## Exercise 2

How do these numbers change if $$m$$ changes?

## Exercise 3

What happens if the computers are 1000 times faster?

## Exercise 4

What is the largest multiple alignment that you can do in your life?

## Exercise 5

What is the largest number of sequences that can be aligned?

## Exercise 6

What can we do to align more sequences?

## Heuristic

This is clearly too expensive, so we need heuristics

(i.e. solving a similar but simpler problem)

One common idea is to do a progressive alignment

• and then align the rest one by one

## There are too many heuristics

There are several ways to simplify the original problem

Thus, there are many approximate solutions

The main differences are:

• How to decide what to align first?
• Can new sequences change the previous alignment?
• Can $$s_k$$ change the alignment of $$(s_1,…,s_{k-1})$$?
• How to use additional information (like 3D structure)?
• What is the formula for $$\text{Score}_k(s_{1}[i_1],…,s_{k}[i_k])$$?

## Some common Multiple-Alignment tools

• Clustal
• ClustalW, ClustalX, Clustal Omega
• T-Coffee
• 3D-Coffee
• MUSCLE
• MAFFT