November 13th, 2018

What to do with the sequence

So you get the DNA sequence. What’s next?

What is this?

Imagine that you used PCR to cut the 16S gene

Identification of a sequence

We have today databases of known genes

For example, databases of 16S gene

16S gene (in bacteria) is

  • Universal
  • Highly conserved
  • Typically used to identify species
  • non-coding
  • each nucleotide is important

how do we compare sequences?

We want to compare our query sequence q to each sequence in the database

The database’s sequence is called subject, we write s

If they are identical, we are ready

If they are not identical, then we define a distance

Hamming distance

D(q,r) is the number of “mismatches”

GGGTGAGTAACACGTGGGTAACCTACCTCAAAGAGGGGG
GGGTGAGTAACACGTGGGCAACCTGCCTCAAAGTGGGGG

here we need that the sequences have the same length

Good properties of a distance

  • D(q,s)==D(s,q) (symmetry)
  • D(q,s)>=0 (non-negative)
  • if D(q,s)==0 then q==s (reflexivity)
  • D(q,s1) + D(q,s2) <= D(s1,s2) (triangular inequality)

Hamming distance has all these properties

But it is not the best distance for biological sequences

Sometimes we need gaps

When there is an insertion or deletion, one of the sequences is “shorter” than the other

Here we cannot use Hamming distance. Instead we use Edit distance, also called Levenshtein distance

Distance = mismatches + insertion + deletions

but there are many ways to put the gaps

We really want to find the gaps that produce “the best” combination

Alignment
put two sequences in line
Optimal Alignment
Align two sequences with gaps to get the minimum distance.
Choose the gaps “wisely”.

This problems is solved by parts

The first part is easy. All letters are identical.
We need to decide only for the new letter. We have only three options

mismatch

query   A   T   A   A   T
subject A   T   A   A   C

deletion

query   A   T   A   A   -
subject A   T   A   A   C

insertion

query   A   T   A   A   T
subject A   T   A   A   -

Distance value for the three options

  • D(q[1:5], s[1:5]) = D(q[1:4], s[1:4]) + D(q[5], s[5]) (match or mismatch)
  • D(q[1:4], s[1:5]) = D(q[1:4], s[1:4]) + gap(s[5]) (deletion)
  • D(q[1:5], s[1:4]) = D(q[1:4], s[1:4]) + gap(q[5]) (insertion)

Usually gaps are more “expensive”

Position in query may be different than position in subject

In general we have

  • D(q[1:i], s[1:j]) = D(q[1:(i-1)], s[1:(j-1)]) + D(q[i], s[j]) (match or mismatch)
  • D(q[1:(i-1)], s[1:j]) = D(q[1:(i-1)], s[1:(j-1)]) + gap(s[j]) (deletion)
  • D(q[1:i], s[1:(j-1)]) = D(q[1:(i-1)], s[1:(j-1)]) + gap(q[i]) (insertion)

which one to choose?

Easy: the “cheapest” one

In other words, we find the minimum

For this, programs use a matrix

Alignment matrix

One sequence in the rows

the other in the columns

Initially we have C[q[i], s[j]], defined as

  • 0 if q[i] == s[j]
  • 1 if q[i] <> s[j]

Second step

cumulate the total cost

M[i, j] = M[i-1, j-1] + C[q[i], s[j]]

the last corner of the matrix M has the distance between the two sequences

(but we have not yet used gaps)

Third step: gaps

moving vertical and horizontal

M[i, j] = min( M[i-1,j-1] + C[q[i], s[j]], 
               M[i,j-1] + gap(s[j]),
               M[i-1,j] + gap(q[i])
            )
  • when we delete, we change i without changing j
  • when we insert, we change j without changing i

Final solution depends on “gap cost”

If gaps are “cheap”, the method will use more gaps

If gaps are expensive, the alignment will have more mismatches