December 11, 2018

Multiple alignment

  • More than two sequences at the same time
  • semi-global alignment
  • Instead of filling a bi-dimensional matrix, we should fill a n-dimensional array
  • Cost \(L^n\) in memory and time
    • \(n\) sequences of length \(L\)
  • Impossible for any practical case

Heuristics

The light is better here

Nasrettin Hoca had lost his ring in the living room. He searched for it for a while, but since he could not find it, he went out into the yard and began to look there. His wife, who saw what he was doing, asked: “Hocam, you lost your ring in the room, why are you looking for it in the yard?”

Nasrettin Hoca stroked his beard and said: “The room is too dark and I can’t see very well. I came out to the courtyard to look for my ring because there is much more light out here.”

Heuristic: approximate answer

Algorithm: give a precise answer to a question Heuristics: Give a fast and approximate answer

Heuristic is an algorithm for a simpler question, that is related to the original one

We hope that the approximate answer will be close to the real one

Example: BLAST is an heuristic

  • The local alignment algorithm is Smith-Waterman
    • “filling the matrix with positive numbers and finding diagonals”
  • BLAST and other index-based methods are heuristics that solve a simpler problem
  • BLAST may miss some alignments: false negatives

Neighbor joining

Heuristic for multiple alignments

  • Build a “guide tree” to organize the alignment
  • The tree is built based on the edit distance between all sequences
    • i.e. we need to calculate \(\approx n^2\) distances
    • this is a \(n\times n\) matrix
  • Find the “closest neighbors” and “join” them
    • Record which sequences are being joined
  • Create a new matrix where the two rows (and columns) are replaced by a single one
    • new matrix is \((n-1)\times (n-1)\)
  • Repeat \(n-1\) times

Building the alignment

Once we got the tree, we use it as a guide for the alignment

  • First align the “closest neighbors”, putting gaps if necessary
  • Then align other reads to this alignment
    • Score of each position is the average of all the scores in that position
  • The aligned sequences can be represented by a “frequency matrix”