December 11, 2018

## Multiple alignment

• More than two sequences at the same time
• semi-global alignment
• Instead of filling a bi-dimensional matrix, we should fill a n-dimensional array
• Cost $$L^n$$ in memory and time
• $$n$$ sequences of length $$L$$
• Impossible for any practical case

## The light is better here

Nasrettin Hoca had lost his ring in the living room. He searched for it for a while, but since he could not find it, he went out into the yard and began to look there. His wife, who saw what he was doing, asked: “Hocam, you lost your ring in the room, why are you looking for it in the yard?”

Nasrettin Hoca stroked his beard and said: “The room is too dark and I can’t see very well. I came out to the courtyard to look for my ring because there is much more light out here.”

Algorithm: give a precise answer to a question Heuristics: Give a fast and approximate answer

Heuristic is an algorithm for a simpler question, that is related to the original one

We hope that the approximate answer will be close to the real one

## Example: BLAST is an heuristic

• The local alignment algorithm is Smith-Waterman
• “filling the matrix with positive numbers and finding diagonals”
• BLAST and other index-based methods are heuristics that solve a simpler problem
• BLAST may miss some alignments: false negatives

## Neighbor joining

### Heuristic for multiple alignments

• Build a “guide tree” to organize the alignment
• The tree is built based on the edit distance between all sequences
• i.e. we need to calculate $$\approx n^2$$ distances
• this is a $$n\times n$$ matrix
• Find the “closest neighbors” and “join” them
• Record which sequences are being joined
• Create a new matrix where the two rows (and columns) are replaced by a single one
• new matrix is $$(n-1)\times (n-1)$$
• Repeat $$n-1$$ times

## Building the alignment

Once we got the tree, we use it as a guide for the alignment

• First align the “closest neighbors”, putting gaps if necessary
• Then align other reads to this alignment
• Score of each position is the average of all the scores in that position
• The aligned sequences can be represented by a “frequency matrix”