November 23, 2018

Using BLAST

There are two ways of using BLAST

  • Going to NCBI’s website: https://blast.ncbi.nlm.nih.gov/
    • Runs in NCBI servers with NCBI databases
  • Using a command line version
    • Runs in our server with our databases
    • can also send jobs in NCBI servers with NCBI databases
    • Download it from NCBI website

For today we can look at NCBI page

Types of BLAST

BlastN
Search nucleotides in nucleotide databases
BlastP
Search proteins in protein databases
BlastX
Search nucleotide in protein databases.
Each query is translated into 6 putative proteins
TBlastN
Search proteins in nucleotide databases.
Each subject is translated into 6 putative proteins
TblastX
Search nucleotides in nucleotide databases
Translate each query and each subject into 6 proteins
Compares all the resulting proteins

BlastN variants

megablast
Highly similar sequences
discontiguous megablast
More dissimilar sequences
blastn
Somewhat similar sequences

BlastP variants

blastp
protein-protein BLAST.
PSI-BLAST
Position-Specific Iterated BLAST.
builds a position-specific scoring matrix.
PHI-BLAST
Pattern Hit Initiated BLAST.
limits alignments to those that match a pattern in the query.

BlastP variants

Quick BLASTP
Accelerated protein-protein BLAST.
very fast and works best if the target percent identity is 50% or more.
DELTA-BLAST
Domain Enhanced Lookup Time Accelerated BLAST.
builds a PSSM using a Conserved Domain Database search.
searches a sequence database.

Homework

  • Write a document explaining the details of BLAST algorithms
  • Each student takes a different algorithm
    • megablast and discontiguous megablast
    • PSI-BLAST
    • PHI-BLAST
    • Quick BLASTP
    • DELTA-BLAST
    • and …

Filters & Masking

Explain what are these options

  • Low complexity regions filter
  • Mask for lookup table only
  • Mask lower case letters

Rules

Understand and explain

  • why the algorithm is useful
  • how does it work
  • what is the difference with the standard BLAST
  • Read the associated paper

Use your words: Do not copy and paste

Give proper references when citing someone else’s work

Back to our class

NCBI protein databases

nr
Non-redundant protein sequences
refseq_protein
Reference proteins
landmark
Model Organisms
swissprot
UniProtKB/Swiss-Prot

NCBI protein databases

pat
Patented protein sequences
pdb
Protein Data Bank proteins
env_nr
Metagenomic proteins
tsa_nr
Transcriptome Shotgun Assembly proteins

NCBI nucleotide databases

Human G+T
Human genomic plus transcript
Mouse G+T
Mouse genomic plus transcript
nr/nt
Nucleotide collection
Bacteria and Archaea
16S ribosomal RNA sequences

NCBI nucleotide databases

refseq_rna
Reference RNA sequences
refseq_representative_genomes
RefSeq Representative genomes
refseq_genomes
RefSeq Genome Database
wgs
Whole-genome shotgun contigs

NCBI nucleotide databases

est
Expressed sequence tags
SRA
Sequence Read Archive
TSA
Transcriptome Shotgun Assembly
HTGS
High throughput genomic sequences

NCBI nucleotide databases

pat
Patent sequences
pdb
Protein Data Bank
refseq_genomic
Reference genomic sequences
RefSeq_Gene
Human RefSeqGene sequences

NCBI nucleotide databases

gss
Genomic survey sequences
dbsts
Sequence tagged sites

Let’s test it

Let’s look for Teje’s data

This is a 16S gene amplified by PCR

NCBI taxonomy

There is no “official” taxonomy. People are still figuring out many cases

NCBI has an “unofficial” taxonomy tree

Trees have root, branches, internal nodes and leaves

Edges (branches) connect nodes

Each node (except the root) has one unique parent node

A node can have several descendants. If a node has no descendants, we call it a leave

NCBI taxonomy

Each node has

  • a unique id called taxid
  • An official scientific name (as Linnaeus designed)
  • Any alternative alias by which the organism can be known
  • The taxid of the parent
  • zero or more descendants

Using NCBI taxid prevents many errors