# Class 5: Optimal pairwise alignment

# Bioinformatics

## Andrés Aravena

### October 11, 2021

## How does spell check work?

When you write a text in any good text editor

(or even in Microsoft Word®),

some words are automatically underlined with red

These are words that are not found in the dictionary

Sometimes the editor can suggest the correct word

How does the computer do that?

## How do we find the suggestion?

One option is to define a rule, like a program, that will tell if a word is correct or not \[
isCorrect: Words ↦ \{True, False\}
\] That is hard, in general

## There is an easier way

Other option is to have a dictionary or *database* of know correct words, and a way to find the nearest word \[
Similar: (Words, Database) ↦ \{Words\in Database\}
\]

## What do we mean by *similar*?

When are two sequences *similar*?

## Calculating a *distance* between strings

First idea:

Here **distance(**`ASELLKYLTT`

,`ASELLKALTT`

**)** is 1

## This is called **Hamming** distance

## Hamming distance examples

How many *substitutions* we need to go from one sequence to the other

`CAT`

and `CAT`

have a Hamming distance of 0
`CAT`

and `BAT`

have a Hamming distance of 1
`CAT`

and `BAG`

have a Hamming distance of 2

## Let’s make a Hamming distance calculator in Excel or Google Sheets

Let’s pause a minute and do this exercise

## What if sequences have different length?

In that case we need to insert *“gaps”*

```
MOUSE
GROUSE
```

## Placing two sequences face to face, inserting gaps if necessary is called **Pairwise Alignment**

## There are many ways to insert gaps

The distance depends on the gaps positions

```
-MOUSE
GROUSE
```

*Hamming Distance=2*

```
MOUSE--
-GROUSE
```

*Hamming Distance=7*

## So, which one is the distance?

In geometry we say that *“the distance between a point X and a line L is ***the shortest** length from X to any point in L”

## Same idea

If there many “candidate distances”, we choose the smallest one

**This is an important idea**

**Distance** is the length of the shortest path

## Now we also count gaps

**Hamming** distance counts *substitutions* between sequences

Now we counts *substitutions*, *insertions* and *deletions*

## This is called **Levenstein** distance

## Hamming versus Levenstein

**Hamming** distance counts *substitutions*

```
ABCDEFGHIJ
BCDEFGHIJA
```

*Hamming distance=10*

**Levenstein** counts *substitutions*, *insertions* and *deletions*

```
ABCDEFGHIJK-
-BCDEFGHIJKA
```

*Levenstein distance=2*

## What is the best way to insert gaps?

If the sequence has *m* letters, there are *2*^{m+1} ways to insert a single gap

```
CAT
CAT-
CA-T
CA-T-
C-AT
C-AT-
C-A-T
C-A-T-
```

```
-CAT
-CAT-
-CA-T
-CA-T-
-C-AT
-C-AT-
-C-A-T
-C-A-T-
```

## And that is not even counting larger gaps

We could also do

`C--A-----T`

or

`----CA-T--`

or

`-C--A--T--`

## There is a better way to find the best

We will see more details later

The idea is to draw a rectangle

- One sequence in the rows
- Another sequence in the columns
- Mark the cells where row letter and column letter are equal

## It looks like this

The goal is to move from one corner to the other

Jumping black to black is free

Horizontal and vertical moves are gaps

## We move from one corner to the other

## White blocks add to the distance

## Summary

- Hamming distance
- Levenstein distance
- Dot plot