Wikipedia
ABRACADABRA
CHUPACABRA
We have two alignments
ABRACADABRA ABRACADABRA
 
CHUPACABRA CHUPACABRA
External gaps do not count
For local alignment we have two cases
Global and semiglobal alignments always have gaps
“Distance” means that smaller numbers are better
If the distance is 0, then the two sequences are identical
If the distance is small, the two sequences are similar
If the distance is big, the sequences are different
Since external gaps do not count, we can always use them
In this case the smallest distance is always 0
The smallest cost is achieved taking one letter
ABRACADABRA

CHUPACABRA
We cannot find local alignments using small values
(it is not a minimization problem)
Instead of finding a minimum, we look for a maximum
The philosophy is the same, but we look for big numbers instead of small numbers
Initially we placed 1 or 0 on each cell depending on match or mismatch
Now we put positive or negative numbers, depending of singleletter scores
\[D_{i,j} = \text{Score}(q_i,s_j)\]
Once the “dot plot matrix” is complete, it is easy to find the optimal score
Global alignment: find the largest sum from corner to corner
Semiglobal alignment: find the largest sum from side to side
Local alignment: find the largest sum in any diagonal
We have three options as before, but not negative values
\[M_{i,j} = \max\begin{cases} M_{i1,j1} + \text{Score}(q_i,s_j)\\ M_{i1,j} + \text{gap} \\ M_{i,j 1} + \text{gap}\\ 0\end{cases}\]
Notice we use \(\max\) instead of \(\min\)
We prefer this alignment
GGGTAACCTACCTC
  
GGGCAACCTGCCTC
instead of this other alignment
GGGTAACCTACCTC
  
GGGCAACCTGCCTC
Thus, gap penalty must be greater than mismatch penalty
We prefer this alignment
TCAAAGAGGATA
  
TCAGAGGGGGATA
instead of this other alignment
TCAAAGAGGATA
     
TCAGAGGGGGATA
We want few long gaps instead of many short gaps
Gap values must reflect how real insertions and deletions occur in nature
We observe that, once an indel event starts, it can easily grow
If the polymerase jumps, then it can jump a long distance
To represent this, we use affine gaps
So far we considered only linear gaps
The penalty of \(n\) consecutive
gaps is \(n\cdot G\)
(\(G\) is the gap penalty)
Now we consider affine gaps, where the first gap is expensive, but the consecutive are cheap
The penalty of \(n\) consecutive gaps is \(I + n\cdot E\)
\(I\) is the initial gap penalty, \(E\) is the gap extension penalty
The most common tool for local alignment is BLAST
Basic Local Alignment Search Tool
BLAST is not Global Alignment
There are two ways of using BLAST
For today we can look at NCBI page
Depending on the alphabet of the query and subject
There are others. Today we only use these ones
There are others. Today we only use these ones
These databases get data from several sources
Sometimes two people upload the same sequence but with different ID
For example, EMBL ID, GenBank ID, RefSeq ID, etc.
This database combines all identical entries into one, and keeps all the alternative IDs
Let’s look for Terje Steinum’s data
This is a 16S gene amplified by PCR
(see the course’s Homepage)