# Bioinformatics

## Today

• Cost of searching
• Heuristic
• Indices
• Word Size

# Computational cost

## What is computational cost?

• How much time does it take to
• find the optimal alignment?
• search the database?
• How much computer memory is needed?

It depends on the technology and the algorithm

Let’s ignore the technology

## Cost depends on input size

Small sizes take short time

Larger sizes take longer time

## Computational cost of alignment

• How many comparisons/multiplications are needed?

• It depends on query and subject size, so $Cost=f(m, n)$ where $$m$$ and $$n$$ are the query and subject lengths

• So, What is the formula?

## Big-O notation

We care on how time depends on the input size

We do not care about fixed numbers

For example $$100m^2 n^4$$ is equivalent to $$m^2 n^4$$

We say “the cost is on the order of $$m^2 n^4$$

We write $O(m^2 n^4)$

## Computational cost of Alignment

Size of dot plot matrix is $$m\cdot n$$

Building dot plot matrix takes time $$m\cdot n$$

Then we have to find “the best diagonals”

Thus, computational cost is $$O(m\cdot n)$$

## That is an Index

• This works for exact match
• It does not replace the original text
• Takes extra space
• search is faster
• sorted files are faster to search

## Divide-and-conquer strategy

The database is $$(s_1,…,s_d)$$

We need two auxiliary variables. Let $$l ←1, u ←d$$

To search in a sorted file, we start in the middle

• Let $$i ← (u-l)/2$$
• If $$q=s_i,$$ we found it.
• If $$q<s_i$$ then we can ignore the second half of database
• $$u ← i$$, and repeat
• If $$q>s_i$$ then we can ignore the first half of database
• $$l ← i$$, and repeat

## Search space halves every time

In a sorted file, we can discard half of the database after every comparison

So we need to compare the query with $$\log(d)$$ subjects

Thus, the search cost is $$O(mn\log_2(d))$$

## With numbers

Let’s say that $$m=100, n=100.$$ Then

d plain.database time with.index time_2
1000 1e+07 10 sec 99658 0.1 sec
10000 1e+08 1.7 min 132877 0.13 sec
1e+05 1e+09 16.7 min 166096 0.17 sec
1e+06 1e+10 2.8 hours 199316 0.2 sec
1e+07 1e+11 1.2 days 232535 0.23 sec
1e+08 1e+12 11.6 days 265754 0.27 sec
1e+09 1e+13 115.7 days 298974 0.3 sec

assuming one million comparisons each second

## Index, gaps and indels

All previous discussion assumed exact match

It can be extended to partial match

Let’s say that there are at most 3 mismatches (indels or gaps) between query and the best subject

• Split the query in 4 parts
• At least one of the parts has zero mismatches
• Look each part using the index
• See if the other parts match with indels and gaps

## In general

To find a partial match with $$n$$ mismatches

• Split the query in $$(n+1)$$ parts
• One of the parts will have a partial match

## This is how BLAST works

BLAST uses indices to look for an initial hit

(sometimes called seed)

Then it tries to extend it using building a dot-matrix around the hit

The key parameter is Word size

## Word size tradeoff

• Words of larger size are faster to search

• But they may miss subjects with many mismatch
• Small word size is more sensitive

• but it will take longer time to find them all

# Heuristics

## BLAST is an Heuristic

• BLAST can miss some cases
• It does not solve the initial question
• because the initial question is too hard
• It is an heuristic
• It gives the correct answer to a similar question
• It give an approximate answer

## Algorithm v/s Heuristics

• An Algorithm is a protocol that always delivers a correct answer to a specific problem in a finite number of steps
• for example, when we divide two numbers
• This was first shown in Al-Khwarizmi’s book
• An Heuristic is a protocol that solves a simplified problem, or that gives an approximate answer with lower cost that the complete algorithm
• For example, instead of asking everybody, we can take a small random sample
• Or computers playing chess

## Most Bioinformatic tools are heuristics

In particular Multiple Alignment

## Summary

• Computational Cost:
• number of steps, depending on input size
• we only care about the order of
• Big-O notation: forget about constants
• Indices speed up searches
• Indices can be used in partial matches
• BLAST is a heuristic
• Heuristics: approximate answers