November 16th, 2018

In the previous class…

We showed how to calculate Edit distance

“Distance” means that smaller numbers are better

If the distance is 0, then the two sequences are identical

If the distance is small, the two sequences are similar

If the distance is big, the sequences are different

We can find edit distance efficiently

The strategy is called Dynamical Programming

Here “Programming” means “finding the best option”

  • divides a large problem into a series of smaller problems
  • uses the smaller solutions to build a solution to the larger problem

Today we are going to do it backwards

We will talk about the Needleman–Wunsch Algorithm

Published in 1970 by Saul B. Needleman and Christian D. Wunsch

They used dynamic programming to find a score for the alignment

The philosophy is the same, but we look for big numbers instead of small numbers

Initially there is no difference in the resulting alignment, but later it will be important.

Definitions

Let’s say we have two sequences q and s

They have length m and n, respectively

  • q="TTAATAATCAGTTTTCAT"
  • s="TTAACATTAGTATTTCA"

q[1:3] is "TTA". s[5:7] is "CAT"

We see that q[1:m] is the same as q, and s[1:n]=s

What we look for

We want to evaluate the score between q and s

We call it Score(q, s). We can write Score(q[1:m], s[1:n])

Since we divide the problem into smaller ones, we need to calculate Score(q[1:i], s[1:j]) for different values of i and j

To simplify, we write M[i, j] instead of Score(q[1:i], s[1:j])

Therefore the edit distance will be

Score(q, s) = Score(q[1:m], s[1:n]) = M[m, n]

Single-letter scores

If we pair two letters x and y, we need to know how they contribute to the score value. For example,

  • if x == y the score increases by 1
  • if x != y the score decreases by 1
    • (increases by -1)

In general, we have a substitution matrix C[x, y] for each x and y

If one of the letters is - (a gap), the score reduces by a value G

Starting from the end

Since Score(q, s) = M[m, n] we just need to evaluate M[m, n]

Let’s assume that we can solve the problem just before the last position of the alignment

We want to align the last two letters. We have three options

  • pairing q[m] with s[n], match or mismatch
  • pairing q[m] with a gap -, i.e. insertion
  • pairing a gap - with s[n], i.e. deletion

Score of the alignment

Which option do we take? To calculate M[m,n] we need to know the value of each option

  • pairing q[m] with s[n] has score C[q[m], s[n]]
    • M[m-1,n-1] + C[q[m], s[n]]
  • pairing q[m] with a gap - has penalty G
    • M[m-1,n] - G
  • pairing a gap - with s[n] has penalty G
    • M[m,n-1] - G

So we need M[m-1,n-1], M[m-1,n], and M[m,n-1]

Recursive Evaluation

We see that to calculate the optimal alignment score we need to calculate all the values of the matrix M[i,j]

In every step we choose the option that produces the highest score

M[i, j] = max(
              M[i-1, j-1] + C[q[i], s[j]],
              M[i-1, j] - G,
              M[i, j-1] - G
             )

Boundary conditions

To get M[m,n] we need M[m-1,n-1] and others

To get M[m-1,n-1] we need M[m-2,n-2] and others

What happens when we get to M[0,0], M[0,1] or M[1,0]?

That is, what happens when we look before the beginning of the sequences?

Before the sequences there are gaps

  • If one of the sequences is larger than the other
    • or if one is contained in the other
  • then we can put gaps outside the sequence
----------AACCTACCTCAAAGAGGGGGATA------------------
ACACGTGGGCAACCTGCCTCAAAGTGGGGGATAGCCTTCCGAAAGGAAGAT

To represent that, we fix the values of M[i,0] and M[0,j]

Gaps in the border

If we follow the idea that both sequences must be equal, then each gap outside will contribute with -G to the score

M[0,0] = 0, M[1,0] = -G, M[2,0] = -2*G, M[i,0] = -i*G

M[0,0] = 0, M[0,1] = -G, M[0,2] = -2*G, M[0,j] = -j*G

This strategy is called Global Alignment

Global v/s semi-global alignment

If instead we assume that one of the sequences is smaller than the other, then each gap outside the sequence must not change the score

M[0,0] = 0, M[1,0] = 0, M[2,0] = 0, M[i,0] = 0

M[0,0] = 0, M[0,1] = 0, M[0,2] = 0, M[0,j] = 0

This strategy is called Semi-Global Alignment

Where is the score?

In semi-global alignment it is not true that Score(q, s) = M[m, n]

Instead, we have to look at all the values in the last row

Score_semiglobal(q, s) = max(M[m, 1:n])

Scoring matrices

The scoring matrix (also known as a substitution matrix) has one row and one column for each letter in our alphabet

For example 4 rows and 4 columns for DNA sequences

In DNA it is common to have +1 for match and -2 for mismatch

Sometimes we can use (+1,-3), (+1,-1) or even (4,-5)

DNA scoring matrix

   A  C  G  T
A  1 -2 -2 -2
C -2  1 -2 -2
G -2 -2  1 -2
T -2 -2 -2  1

Amino-acid scoring matrices

For proteins the alphabet has 20 letters, so the scoring matrix is 20x20

The scores of these matrices are based on the relative frequency of each mutation in nature

Obviously, we only care about non-lethal mutations

More specifically, we care about the Point Mutations that are Accepted

These are called PAM matrices (Margaret Dayhoff, 1978)

PAM matrices

Based on 1572 mutations in 71 families of closely related proteins

Dayhoff only used alignments having at least 85% identity

She assumed that any aligned mismatches were the result of a single mutation event, rather than several at the same location

PAM1 is the scoring matrix for proteins in a “short” period

With some easy mathematics we can calculate PAM2, PAM30, PAM250, etc

They represent mutations in 2, 30, and 250 short periods

PAM30 matrix

    A   R   N   D   C   Q   E   G   H   I   L   K   M   F   P   S   T   W
A   6  -7  -4  -3  -6  -4  -2  -2  -7  -5  -6  -7  -5  -8  -2   0  -1 -13
R  -7   8  -6 -10  -8  -2  -9  -9  -2  -5  -8   0  -4  -9  -4  -3  -6  -2
N  -4  -6   8   2 -11  -3  -2  -3   0  -5  -7  -1  -9  -9  -6   0  -2  -8
D  -3 -10   2   8 -14  -2   2  -3  -4  -7 -12  -4 -11 -15  -8  -4  -5 -15
C  -6  -8 -11 -14  10 -14 -14  -9  -7  -6 -15 -14 -13 -13  -8  -3  -8 -15
Q  -4  -2  -3  -2 -14   8   1  -7   1  -8  -5  -3  -4 -13  -3  -5  -5 -13
E  -2  -9  -2   2 -14   1   8  -4  -5  -5  -9  -4  -7 -14  -5  -4  -6 -17
G  -2  -9  -3  -3  -9  -7  -4   6  -9 -11 -10  -7  -8  -9  -6  -2  -6 -15
H  -7  -2   0  -4  -7   1  -5  -9   9  -9  -6  -6 -10  -6  -4  -6  -7  -7
I  -5  -5  -5  -7  -6  -8  -5 -11  -9   8  -1  -6  -1  -2  -8  -7  -2 -14
L  -6  -8  -7 -12 -15  -5  -9 -10  -6  -1   7  -8   1  -3  -7  -8  -7  -6
K  -7   0  -1  -4 -14  -3  -4  -7  -6  -6  -8   7  -2 -14  -6  -4  -3 -12
M  -5  -4  -9 -11 -13  -4  -7  -8 -10  -1   1  -2  11  -4  -8  -5  -4 -13
F  -8  -9  -9 -15 -13 -13 -14  -9  -6  -2  -3 -14  -4   9 -10  -6  -9  -4
P  -2  -4  -6  -8  -8  -3  -5  -6  -4  -8  -7  -6  -8 -10   8  -2  -4 -14
S   0  -3   0  -4  -3  -5  -4  -2  -6  -7  -8  -4  -5  -6  -2   6   0  -5
T  -1  -6  -2  -5  -8  -5  -6  -6  -7  -2  -7  -3  -4  -9  -4   0   7 -13
W -13  -2  -8 -15 -15 -13 -17 -15  -7 -14  -6 -12 -13  -4 -14  -5 -13  13
Y  -8 -10  -4 -11  -4 -12  -8 -14  -3  -6  -7  -9 -11   2 -13  -7  -6  -5
V  -2  -8  -8  -8  -6  -7  -6  -5  -6   2  -2  -9  -1  -8  -6  -6  -3 -15
B  -3  -7   6   6 -12  -3   1  -3  -1  -6  -9  -2 -10 -10  -7  -1  -3 -10
J  -6  -7  -6 -10  -9  -5  -7 -10  -7   5   6  -7   0  -2  -7  -8  -5  -7
Z  -3  -4  -3   1 -14   6   6  -5  -1  -6  -7  -4  -5 -13  -4  -5  -6 -14
X  -1  -1  -1  -1  -1  -1  -1  -1  -1  -1  -1  -1  -1  -1  -1  -1  -1  -1
* -17 -17 -17 -17 -17 -17 -17 -17 -17 -17 -17 -17 -17 -17 -17 -17 -17 -17
    Y   V   B   J   Z   X   *
A  -8  -2  -3  -6  -3  -1 -17
R -10  -8  -7  -7  -4  -1 -17
N  -4  -8   6  -6  -3  -1 -17
D -11  -8   6 -10   1  -1 -17
C  -4  -6 -12  -9 -14  -1 -17
Q -12  -7  -3  -5   6  -1 -17
E  -8  -6   1  -7   6  -1 -17
G -14  -5  -3 -10  -5  -1 -17
H  -3  -6  -1  -7  -1  -1 -17
I  -6   2  -6   5  -6  -1 -17
L  -7  -2  -9   6  -7  -1 -17
K  -9  -9  -2  -7  -4  -1 -17
M -11  -1 -10   0  -5  -1 -17
F   2  -8 -10  -2 -13  -1 -17
P -13  -6  -7  -7  -4  -1 -17
S  -7  -6  -1  -8  -5  -1 -17
T  -6  -3  -3  -5  -6  -1 -17
W  -5 -15 -10  -7 -14  -1 -17
Y  10  -7  -6  -7  -9  -1 -17
V  -7   7  -8   0  -6  -1 -17
B  -6  -8   6  -8   0  -1 -17
J  -7   0  -8   6  -6  -1 -17
Z  -9  -6   0  -6   6  -1 -17
X  -1  -1  -1  -1  -1  -1 -17
* -17 -17 -17 -17 -17 -17   1

Gap penalty

We prefer this alignment

GGGTAACCTACCTC
GGGCAACCTGCCTC

instead of this other alignment

GGGT-AACCTA-CCTC
GGG-CAACCT-GCCTC

Thus, gap penalty must be greater than mismatch penalty

Long gaps are more common than short ones

We prefer this alignment

TCAAAGAG---GATA
TCA--GAGGGGGATA

instead of this other alignment

TCAAAGA-G-G-ATA
TC-A-GAGGGGGATA

We want few long gaps instead of many short gaps

Affine gaps

Gap values must reflect how real insertions and deletions occur in nature

We observe that, once an indel event starts, it can easily grow

If the polymerase jumps, then it can jump a long distance

To represent this, we use affine gaps

Linear v/s affine

So far we considered only linear gaps

The penalty of n consecutive gaps is n*G

Now we consider affine gaps, where the first gap is expensive, but the consecutive are cheap

The penalty of n consecutive gaps is G + n*E

G is the initial gap penalty, E is the gap extension penalty

Homework

Modify the Excel spreadsheet that we did last class to handle border gaps

  • in global alignment
  • in semi-global alignment