Class 6: Understanding BLAST


Andrés Aravena

November 9, 2023

Looking on a database

We have one sequence, called query

We compare our sequences with each sequence in a Database

(sequences in the database are called subjects)

We get the score of each alignment

We report all subjects with score over a threshold

Columns in tabular output

  • qseqid: query sequence id
  • sseqid: subject sequence id
  • pident: percentage of identical positions
  • length: alignment length (sequence overlap)
  • mismatch: number of mismatches
  • gapopen: number of gap openings
  • qstart: start of alignment in query
  • qend: end of alignment in query
  • sstart: start of alignment in subject
  • send: end of alignment in subject
  • evalue: expect value
  • bitscore: bit score

qseqid and sseqid

query sequence id

subject sequence id

pident and length

percentage of identical positions

  • How many letters are the same in the aligned region

alignment length

  • May be longer or shorter than the query or the subject

mismatch and gapopen

number of mismatches

  • How many letters are different in the aligned region

number of gap openings

  • How many initiation of gaps, independent of their length

qstart and qend

start of alignment in query

end of alignment in query

sstart and send

start of alignment in subject

end of alignment in subject

evalue and bitscore

expect value

bit score

Alignment score depends on Substitution matrix

Score can change

If mismatches and gaps have different cost, the score will change

Sometimes the optimal alignment changes

Therefore alignments are meaningless without knowing the scoring matrices

Later we will discuss how to choose the “best” scoring matrix for each case


What is “a good score”?

We want big scores

How big is big enough?

We need to make several hypothesis

The most common hypothesis is statistical

Larger scores, less hits

A hit is a subject with score over a threshold

Larger score thresholds give less hits

We can estimate the number of hits in a given database, assuming randomness

That is called Expected value

Expected value as a threshold

In practice, we choose a small Expected value

(usually called E-value)

Something like 10-5 or 10-20

What we find is not random
and maybe it is biologically meaningful

E-value depends on the database

The formula for E-value depends on

  • The score \(S\)
  • The query size \(m\)
  • The database size \(n\)
  • The substitution scoring matrix, via \(k\) and \(λ\)

\[E=kmn\exp(-λ S)\]

Same alignments in different databases have different E-value
but the same score

Use the smallest relevant database

Many flavors of BLAST

Types of BLAST

Depending on the alphabet of the query and subject

Search nucleotides in nucleotide databases
Search proteins in protein databases
Search nucleotide in protein databases.
Each query is translated into 6 putative proteins

Types of BLAST

Search proteins in nucleotide databases.
Each subject is translated into 6 putative proteins
Search nucleotides in nucleotide databases
Translate each query and each subject into 6 proteins
Compares all the resulting proteins

NCBI protein databases

Non-redundant protein sequences
Reference proteins
Reference Select proteins

What is “Non-Redundant”?

These databases get data from several sources

Sometimes two people upload the same sequence but with different ID

For example, EMBL ID, GenBank ID, RefSeq ID, etc.

This database combines all identical entries into one, and keeps all the alternative IDs

NCBI protein databases

Model Organisms
Patented protein sequences

NCBI protein databases

Protein Data Bank proteins
Metagenomic proteins
Transcriptome Shotgun Assembly proteins

NCBI nucleotide databases

Human G+T
Human genomic plus transcript
Mouse G+T
Mouse genomic plus transcript
Nucleotide collection

NCBI nucleotide databases

Bacteria and Archaea
16S ribosomal RNA sequences
Reference Select sequences
Reference RNA sequences

NCBI nucleotide databases

RefSeq Representative genomes
RefSeq Genome Database

NCBI nucleotide (reads)

Sequence Read Archive
Transcriptome Shotgun Assembly
High throughput genomic sequences

NCBI nucleotide databases

Patent sequences
nucleotides in Protein Data Bank
Human RefSeqGene sequences

BlastN variants

Highly similar sequences
discontiguous megablast
More dissimilar sequences
Somewhat similar sequences

BlastP variants

protein-protein BLAST.
Position-Specific Iterated BLAST.
builds a position-specific scoring matrix.
Pattern Hit Initiated BLAST.
limits alignments to those that match a pattern in the query.

BlastP variants

Accelerated protein-protein BLAST.
very fast and works best if the target percent identity is 50% or more.
Domain Enhanced Lookup Time Accelerated BLAST.
builds a PSSM using a Conserved Domain Database search.
searches a sequence database.