Class 8: Cost, Indices, Heuristics

Bioinformatics

Andrés Aravena

20 November 2020

Global, semi-global, local

Three answers to three different questions

  • Align sequence to sequence

  • Align sequence to subsequence

  • Align subsequence to subsequence

BLAST in Summary

  • Blast is local, not global

  • Each subject get a score,

    • if the score is over a threshold, it is a hit
  • The threshold depends on the chosen E-value

Substitution Score Matrix

  • The score depends on the substitution matrix and gap costs

  • These depend on evolutionary hypotheses

    • Learn about PAM and BLOSUM
  • Choose the matrix wisely

  • Score does not depend on the database

E-value depends on database size

  • Databases may change with time

  • Therefore E-values may change

  • Choose your database wisely

  • Write down the date when the search

Comparing two sequences gives a distance

Aligning two sequences gives a score

Searching a sequence in a database gives an E-value

Computational Cost

What is computational cost

  • How much time does it take to
    • find the optimal alignment?
    • search the database?
  • How much computer memory is needed?

It depends on the technology and the algorithm

Let’s ignore the technology

Computational cost

  • How many comparisons/multiplications are needed?

  • It depends on query and subject size, so \[Cost=f(m, n)\] where \(m\) and \(n\) are the query and subject lengths

  • So, What is the formula?

Cost depends on input size

Small sizes take short time

Larger sizes take longer time

Big-O notation

We do not care about specific numbers

For example \(100m^2 n^4\) is equivalent to \(m^2 n^4\)

We say “the cost is on the order of \(m^2 n^4\)

We write \[O(m^2 n^4)\]

Computational cost of Alignment

Size of dot plot matrix is \(m\cdot n\)

Building dot plot matrix takes time \(m\cdot n\)

Then we have to find “the best diagonals”

Thus, computational cost is \(O(m\cdot n)\)

Do not confuse the matrices

A substitution matrix are 4×4 or 20×20

It is also known as a scoring matrix

It is symmetric

It tells us what to write on each cell of dot plot matrix

Looking the diagonals

For local alignment we have two cases

  • Ungapped, where we can only move in diagonal
  • Gapped, where we can move diagonal, horizontal, and vertical
    • Horizontal and vertical movement gives gaps

Global and semi-global alignments always have gaps

What was the cost?

Indices

Now try with this

a  1,2 2,13 3,7 3,11 4,10 5,9 7,3 8,8 9,7 9,11 10,3 12,8 12,13 13,5 17,15 23,5 41,11 43,11
about  37,8
above-named  19,7
absorbing  44,11
acre  24,5
adventure  41,8
afflicted  30,6
again  31,6 36,11
age  11,9
all  20,5 25,13 37,6 39,13
almost  21,11
although  15,8
among  14,12
an  3,4 4,2 24,4
and  3,10 5,8 7,8 7,11 10,2 10,8 13,4 21,7 22,7 23,2 23,10 24,15 27,6 28,12 34,5 35,2 38,1 39,10 40,2 41,9 42,8 43,9 44,9
ardour  21,6
aristotle  35,10
as  11,2 11,4 25,3 25,7 26,5 27,10 38,10 42,12
at  19,12 31,2 37,5
author's  40,8
authors  15,2
avidity  21,8
awake  34,9
away  6,4
be  17,10
beauty  31,4
because  38,3
beef  4,7
belianis  37,13
best  8,13
bill-hook  11,7
body  39,11
book  41,1
books  21,1 24,10
bordering  12,2
brave  8,9
breadth  18,2
breeches  7,10
brought  25,1
buckler  3,6
but  17,2 25,11
buy  24,9
call  1,16
called  16,8
came  28,9
cartels  29,1
chivalry  21,3 24,12
cloth  7,7
come  36,8
commended  40,5
complicated  27,7
composition  26,13
conceits  27,8 33,6
conjectures  16,1
could  25,9 35,12
coursing  4,1
courtships  28,11
covered  39,12
cured  39,2
de  26,11
desert  33,1
deserves  33,4
deserving  32,10
desire  1,14
did  23,7
difference  14,9
divinely  32,2
divinity  32,1
don  37,12
done  43,8
doublet  7,4
doubt  43,4
eagerness  23,9
early  13,2
easy  37,7
ending  40,11
enough  17,11
entirely  21,12
even  22,8
extra  5,13
extracted  36,5
face  39,9
famous  26,9
feliciano  26,10
field  10,7
field-sports  22,6
fifty  12,4
figure  8,10
fine  7,6
finish  42,9
for  3,13 8,1 10,5 14,4 27,1 36,12
fortify  32,3
forty  9,10
found  29,5
fridays  5,7
from  15,9 18,3
gaunt-featured  12,12
gave  20,9 37,14
gentleman  11,12 19,8 34,1
gentlemen  2,10
get  25,10
go  23,12
great  13,6 38,9
greater  44,8
greatness  33,3
greyhound  3,12
habit  12,10
hack  3,9 11,1
had  9,3 36,6 39,1 39,7 44,6
hair's  18,1
handle  11,5
hardy  12,9
have  1,12 13,10 36,1 39,6 43,7
he  8,6 9,2 12,5 16,6 19,10 21,10 24,1 25,8 26,1 28,8 29,3 36,7 37,2 39,4 40,4 42,1 43,5
heavens  31,9
here  14,5
high  31,8
him  38,7 39,3 45,1
himself  20,10 35,11
his  6,8 8,12 9,5 13,12 22,5 22,12 23,8 28,1 28,6 34,3 39,8 40,12 42,6
holidays  8,2
home  25,2
homespun  9,1
house  9,6
housekeeper  9,8
however  16,11 40,6
i  1,11 30,14
importance  17,4
in  1,1 3,1 7,2 8,11 9,4 18,6 27,12 28,5
income  6,9
infatuation  23,11
interminable  41,7
is  14,7 16,12 30,5 42,13
it  6,13 13,11 16,2 17,8 18,10 38,4 42,10 44,4
keep  2,12
know  19,3
la  1,5
lad  10,4
lance  2,14
lance-rack  3,3
lean  3,8
leisure  20,1
lentils  5,5
lie  34,8
life  36,10
like  29,7
liked  26,2
little  17,3
lived  2,3
long  2,5
lost  34,2
lucidity  27,3
made  6,3 8,7 36,2 43,10
management  22,10
mancha  1,6
many  24,3 25,4 41,10
market-place  10,9
match  7,14
meaning  35,5
mind  2,1
more  4,6 44,10
most  4,13
mostly  20,4
murmur  31,1
must  19,2 39,5
mutton  4,9
my  30,3 30,9
name  1,8
neglected  22,1
niece  9,12
nights  5,1
no  1,13 43,3
none  25,16
not  2,4 17,12 35,13 37,4 44,7
of  1,4 1,9 2,8 4,4 6,7 6,12 7,5 11,10 11,13 12,7 14,10 17,1 18,9 21,2 22,4 22,11 24,6 24,11 25,5 25,12 26,7 27,4 29,10 31,11 32,11 33,7 35,7 40,10 41,5 44,1 44,3
often  29,4
old  3,5
olla  4,3
on  4,12 5,3 5,6 6,1 8,4 12,3 15,5
one  2,7
opinion  14,11
or  5,11 14,2 31,5 36,4
our  17,6
ours  11,14
out  35,6 36,3
over  33,5 39,14
particularly  28,3
passages  29,6
past  9,9
pearls  27,11
pen  42,7
piece  43,13
pigeon  5,10
pitch  23,6
plain  16,4
poor  33,11
prevented  44,13
promise  41,4
properly  42,11
property  23,1
proposed  43,1
purpose  37,1
pursuit  22,3
quejana  16,9
quesada  14,3
quijada  14,1
rather  4,5
read  24,14
reading  20,13 28,7
reason  29,9 30,4 30,10 30,13
reasonable  15,10
render  32,8
rest  6,11
riser  13,3
round  20,8
saddle  10,13
salad  4,11
saturdays  5,4
scars  40,3
scraps  5,2
seams  40,1
seemed  38,5
seems  16,3
shoes  7,12
sight  28,2
silva's  26,12
since  2,6
so  5,12 26,3 30,7
sold  24,2
some  14,8
sort  33,9
spare  12,11
special  36,14
sportsman  13,7
stars  32,7
stray  17,14
striving  34,10
style  27,5
subject  15,7
successful  43,12
such  21,5 23,4
sundays  6,2
surgeons  38,13
surname  13,13
take  42,4
tale  17,7
telling  18,8
tempted  42,2
than  4,8
that  2,11 16,5 19,5 21,9 23,13 30,11 31,10 36,13 38,8 41,6
the  1,7 3,2 6,10 10,6 10,14 11,6 11,8 15,1 15,6 18,4 18,7 19,6 20,6 22,2 22,9 26,8 29,8 29,11 31,7 32,6 32,12 33,10 35,4 37,9 38,12 40,7 41,3
their  27,2
them  25,6 35,1 35,8
then  19,4
there  2,2 14,6 25,14 42,14
they  13,8
this  11,11 16,10 33,8
those  2,9 26,6
thoughts  44,12
three-quarters  6,6
tillageland  24,7
time  41,12
to  1,15 1,17 7,13 10,12 17,5 17,13 20,12 23,3 24,8 24,13 34,7 34,11 36,9 38,6 42,3
too  44,5
took  38,2
truth  18,5
twenty  10,1
under  9,13
understand  34,12
unreason  29,12
up  20,11 42,5
upon  28,10
used  10,11 34,6
velvet  7,9
very  13,1
village  1,3
was  12,1 12,6 13,14 16,7 19,11 20,3 37,3 41,13
way  40,9
weakens  30,8
week-days  8,5
well  11,3 26,4
went  7,1
were  25,15 27,9 38,11
what  35,9
when  28,4
whenever  19,9
where  29,2
which  1,10 20,2 30,2 37,11 43,2
while  8,3
who  10,10 15,3 38,14
will  13,9 17,9
with  6,5 21,4 30,1 30,12 32,5 39,15 41,2
wits  34,4
work  44,2
worm  35,3
would  43,6
wounds  37,10
write  15,4
year  20,7
you  19,1 32,4 32,9
your  31,3 31,12 33,2

That is an Index

  • This works for exact match
  • It does not replace the original text
  • Takes extra space
  • search is faster
  • sorted files are faster to search

What is the cost of searching a sorted file?

Divide-and-conquer strategy

The database is \((s_1,…,s_d)\)

We need two auxiliary variables. Let \(l ←1, u ←d\)

To search in a sorted file, we start in the middle

  • Let \(i ← (u-l)/2\)
  • If \(q=s_i,\) we found it.
  • If \(q<s_i\) then we can ignore the second half of database
    • \(u ← i\), and repeat
  • If \(q>s_i\) then we can ignore the first half of database
    • \(l ← i\), and repeat

Search space halves every time

In a sorted file, we can discard half of the database after every comparison

So we need to compare the query with \(\log(d)\) subjects

Thus, the search cost is \(O(mn\log_2(d))\)

With numbers

Let’s say that \(m=100, n=100.\) Then

d plain.database time with.index time_2
1000 1e+07 10 sec 99658 0.1 sec
10000 1e+08 1.7 min 132877 0.13 sec
1e+05 1e+09 16.7 min 166096 0.17 sec
1e+06 1e+10 2.8 hours 199316 0.2 sec
1e+07 1e+11 1.2 days 232535 0.23 sec
1e+08 1e+12 11.6 days 265754 0.27 sec
1e+09 1e+13 115.7 days 298974 0.3 sec

assuming \(10^{6}\) comparisons each second

Index, gaps and indels

All previous discussion assumed exact match

It can be extended to partial match

Let’s say that there are at most 3 mismatches (indels or gaps) between query and the best subject

  • Split the query in 4 parts
  • At least one of the parts has zero mismatches
  • Look each part using the index
  • See if the other parts match with indels and gaps

In general

To find a partial match with \(n\) mismatches

  • Split the query in \((n+1)\) parts
  • One of the parts will have a partial match

This is how BLAST works

BLAST uses indices to look for an initial hit

(sometimes called seed)

Then it tries to extend it using building a dot-matrix around the hit

The key parameter is Word size

After finding a seed,
BLAST must validate it

Word size tradeoff

  • Words of larger size are faster to search

    • But they may miss subjects with many mismatch
  • Small word size is more sensitive

    • but it will take longer time to find them all

BLAST does not guarantee
to find all matching sequences

BLAST is an Heuristic

  • BLAST can miss some cases
  • It does not solve the initial question
    • because the initial question is too hard
  • It is an heuristic
    • It gives the correct answer to a similar question
    • It give an approximate answer

Algorithm v/s Heuristics

  • An Algorithm is a protocol that always delivers a correct answer to a specific problem in a finite number of steps
    • for example, when we divide two numbers
    • This was first shown in Al-Khwarizmi’s book
  • An Heuristic is a protocol that solves a simplified problem, or that gives an approximate answer with lower cost that the complete algorithm
    • For example, instead of asking everybody, we can take a small random sample
    • Or computers playing chess

Nasreddin hodja know heuristics

Most Bioinformatic tools are heuristics

In particular Multiple Alignment

Summary

  • Computational Cost:
    • number of steps, depending on input size
    • we only care about the order of
  • Big-O notation: forget about constants
  • Indices speed up searches
  • Indices can be used in partial matches
  • BLAST is a heuristic
  • Heuristics: approximate answers

BLAST practice

Each request has an ID

Good information systems assign an ID to everything

NCBI assigns a Request ID to each request

Write it

Use it to recover the result up to 36 hours later

Write down the request id

You can look for the results later

Save search strategy

Long term storage

Must be logged into myNCBI

Result page

  • Job description
  • Descriptions
  • Graphic Summary
  • Alignments

Descriptions

  • Subject accession number
  • Subject description
  • Score
  • Query coverage
  • E-value
  • Percentage identity

you can choose which columns to show

Graphic Summary

Alignments

This is the subsequence that matches, not the whole subject

  • HTML / Plain Text
  • Alignment view
    • Pairwise
    • Pairwise with dots for identities
    • Query-anchored with dots for identities
    • Query-anchored with letters for identities
    • Flat Query-anchored with dots for identities
    • Flat Query-anchored with letters for identities

Download results

  • Text
  • XML
  • ASN.1
  • JSON Seq-align
  • Hit Table(text)
  • Hit Table(csv)
  • SAM