# Bioinformatics

## Genetic code is a function

A function is a rule connecting each element of one set to one element of another set

Sometimes they are written with formulas

Sometimes they are written case-by-case

## Example: NCBI

There’s a function connecting PubMed ids to PubMed records

Each PubMed ID points to one, and only one, PubMed record

The same is true for Nucleotide database, and protein database

(Obviously with different sets on each case)

## Writing the genetic code

TTT F Phe      TCT S Ser      TAT Y Tyr      TGT C Cys
TTC F Phe      TCC S Ser      TAC Y Tyr      TGC C Cys
TTA L Leu      TCA S Ser      TAA * Ter      TGA * Ter
TTG L Leu i    TCG S Ser      TAG * Ter      TGG W Trp

CTT L Leu      CCT P Pro      CAT H His      CGT R Arg
CTC L Leu      CCC P Pro      CAC H His      CGC R Arg
CTA L Leu      CCA P Pro      CAA Q Gln      CGA R Arg
CTG L Leu i    CCG P Pro      CAG Q Gln      CGG R Arg

ATT I Ile      ACT T Thr      AAT N Asn      AGT S Ser
ATC I Ile      ACC T Thr      AAC N Asn      AGC S Ser
ATA I Ile      ACA T Thr      AAA K Lys      AGA R Arg
ATG M Met i    ACG T Thr      AAG K Lys      AGG R Arg

GTT V Val      GCT A Ala      GAT D Asp      GGT G Gly
GTC V Val      GCC A Ala      GAC D Asp      GGC G Gly
GTA V Val      GCA A Ala      GAA E Glu      GGA G Gly
GTG V Val      GCG A Ala      GAG E Glu      GGG G Gly 

## A shorter representation

For our purposes each genetic code is a vector indexed by codons

AAs    = FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
Starts = ---M------**--*----M---------------M----------------------------
Base1  = TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
Base2  = TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
Base3  = TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG

The symbol ’*’ represents a stop codon

The line Starts = shows codons encoding start

(we will ignore the start codon information for now)

## All genetic codes are similar

  01     FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
33     FFLLSSSSYYY*CCWWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSSKVVVVAAAADDEEGGGG

## What do we mean by similar?

When are two sequences similar?

## Genetic codes are different

We compare them by counting how many times they are different

01 FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
02 FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNKKSS**VVVVAAAADDEEGGGG

How many symbols are different?

## Calculating a distance between strings

First idea:

• We count the number of mismatches

• Both strings must have the same length

  ASELLKYLTT
ASELLKALTT

Here distance(ASELLKYLTT,ASELLKALTT) is 1

## Hamming was a smart guy

• Manhattan Project
• programmed the IBM calculating machines
• Bell Telephone Laboratories
• involved in nearly all of the most prominent achievements
• Turing Award in 1968 (“Nobel Prize of Computer Science”)

## Hamming distance examples

How many substitutions we need to go from one sequence to the other

• CAT and CAT have a Hamming distance of 0
• CAT and BAT have a Hamming distance of 1
• CAT and BAG have a Hamming distance of 2

## What if sequences have different length?

In that case we need to insert “gaps”

MOUSE
GROUSE

## There are many ways to insert gaps

The distance depends on the gaps positions

-MOUSE
GROUSE

Hamming Distance = 2

MOUSE--
-GROUSE

Hamming Distance = 7

## So, which one is the distance?

In geometry we say that “the distance between a point X and a line L is the length of the shortest path from X to any point in L”

## Same idea

If there many “candidate distances”, we choose the smallest one

This is an important idea

## Now we also count gaps

Hamming distance counts substitutions between sequences

Now we counts substitutions, insertions and deletions

## Hamming versus Levenstein

Hamming distance counts substitutions

ABCDEFGHIJ
BCDEFGHIJA

Hamming distance=10

Levenstein counts substitutions, insertions and deletions

ABCDEFGHIJK-
-BCDEFGHIJKA

Levenstein distance=2

## What is the best way to insert gaps?

If the sequence has m letters, there are 2m+1 ways to insert a single gap

CAT
CAT-
CA-T
CA-T-
C-AT
C-AT-
C-A-T
C-A-T-
-CAT
-CAT-
-CA-T
-CA-T-
-C-AT
-C-AT-
-C-A-T
-C-A-T-

## And that is not even counting larger gaps

We could also do

C--A-----T

or

----CA-T--

or

-C--A--T--

## There is a better way to find the best

We will see more details later

The idea is to draw a rectangle

• One sequence in the rows
• Another sequence in the columns
• Mark the cells where row letter and column letter are equal

## It looks like this

The goal is to move from one corner to the other

Jumping black to black is free

Horizontal and vertical moves are gaps

## Summary

• Hamming distance
• Levenstein distance
• Dot plot