# Bioinformatics

## How does spell check work?

When you write a text in any good text editor
(or even in Microsoft Word®),
some words are automatically underlined with red

Sometimes the editor can suggest the correct word

How does the computer do that?

## There are two approaches

One option is to define a rule, like a program, that will tell if a word is correct or not $isCorrect: Words ↦ \{True, False\}$ That is hard, in general

Other option is to have a dictionary or database of know correct words, and a way to find the nearest word $Nearest: (Words, Database) ↦ \{Words\in Database\}$

# Searching sequences is the same problem

## Searching a sequence

Let’s say we have a sequence and we want to identify it

Our sequence is called query

We look for it on a list of known sequences

The list is called database

Each sequence is a subject

## Lets search some sequences

query 1: ASELLKYLTT
query 2: ASELLKALTT
query 3: ASELLKYALTT
query 4: ASELLKLTT

Do the search in your computer

## What do we mean by similar?

When are two sequences similar?

## We can calculate a distance between strings

First idea:

• We count the number of mismatches

• Both strings must have the same length

  ASELLKYLTT
ASELLKALTT

Here distance(ASELLKYLTT,ASELLKALTT) is 1

## Hamming distance examples

How many substitutions we need to go from one sequence to the other

• CAT and CAT have a Hamming distance of 0
• CAT and BAT have a Hamming distance of 1
• CAT and BAG have a Hamming distance of 2

## How does database search work

We look for the smallest distance

We calculate the distance between our query and each subject in the database

If the distance is 0, we have found a perfect match

## Let’s make a Hamming distance calculator in Excel or Google Sheets

Let’s pause a minute and do this exercise

## When sequences have different length

In that case we need to insert “gaps”

MOUSE-
GROUSE

Hamming Distance=6

Gaps are represented by -

## There are many ways to insert gaps

The distance depends on the gaps positions

-MOUSE
GROUSE

Hamming Distance=2

MOUSE--
-GROUSE

Hamming Distance=7

## So, which one is the distance?

In geometry we say that “the distance between a point X and a line L is the shortest length from X to any point in L”

## Same idea

If there many “candidate distances”, we choose the smallest one

This is an important idea

## Now we also count gaps

Hamming distance counts substitutions between sequences

Now we counts substitutions, insertions and deletions

## Hamming versus Levenstein

Hamming distance counts substitutions

ABCDEFGHIJ
BCDEFGHIJA

Hamming distance=10

Levenstein counts substitutions, insertions and deletions

ABCDEFGHIJK-
-BCDEFGHIJKA

Levenstein distance=2

## What is the best way to insert gaps

If the sequence has m letters, there are 2m+1 ways to insert a single gap

CAT
CAT-
CA-T
CA-T-
C-AT
C-AT-
C-A-T
C-A-T-
-CAT
-CAT-
-CA-T
-CA-T-
-C-AT
-C-AT-
-C-A-T
-C-A-T-

## And that is not even counting larger gaps

We could also do

C--A-----T

## There is a better way to find the best

We will see more details later

The idea is to draw a rectangle

• Query in the rows
• Subject in the columns
• Mark cells where row letter equals column letter

## It looks like this

The goal is to move from one corner to the other

Jumping black to black is free

Horizontal and vertical moves are gaps

## Exercise

Prepare a DotPlot in Excel or Google Sheets {.center background=“var(–good-blue)” .large .white}

## Comparing a sequence with itself

This allow us to find repeats inside the sequence

## Palindromic sequences / Hairpins

A sequence is said palindromic when it repeats itself backwards

In RNA that results in a single-strand structure called hairpin

They are usually transcription terminators

# Partial matching

## Levenstein distance is Global Alignment

Useful to compare proteins

But not good when we look for a gene in a genome
or a domain in a protein

What shall we do when the query is much smaller than all subjects?

## Semi-global alignment

In this case we want to go from one side to the other

## Gaps in semi-global alignment

The query is much smaller than the subject

In this case we distinguish two kind of gaps

• Internal gaps, inside the sequences

• External gaps, outside the query sequence

## External gaps do not count

External gaps are caused by the experiment

For example, we use PCR to cut part of a genome

Therefore anything outside the sequence cannot be seen for technical reasons

## Internal gaps do count

Internal gaps are caused by nature

They are gained or lost during replication

They have biological meaning

Therefore, they must be counted

## Summary

• Hamming distance
• Levenstein distance
• Dot plot
• Global alignment
• Semi-global alignment