Class 2: Finding data online


Andrés Aravena

September 30, 2021

We need data

International Nucleotide Sequence Database Collaboration

There are three large data repositories

  • National Center for Biotechnology Information, NCBI
    • National Library of Medicine
      • National Institutes of Health, USA
  • European Bioinformatics Institute, EMBL-EBI
    • European Molecular Biology Laboratory
  • DNA Data Bank of Japan (DDBJ)
    • National Institute of Genetics (NIG) Japan

They all have the same data

These three databases interchange all sequence data
but they may have different structure

All data is available for free

Research payed with public money must be uploaded here

Good journals also require to upload data

The NCBI website

GenBank and RefSeq

genetic sequence database, an annotated collection of all publicly available DNA sequences. Anybody can upload directly.
curated subset of GenBank

DNA Databases

most of the sequence data from GenBank, except environmental
sequencing data from the next generation sequencing platforms

Protein Databases

amino acid sequences from the translations of coding regions provided on nucleotide records in GenBank, also imported from the outside data sources (PIR, UniProtKB/Swiss-Prot, Protein Data Bank)
Protein Clusters
collection of related protein sequences (clusters) consisting of Reference Sequence proteins encoded by complete prokaryotic genomes, eukaryotic organelle plasmids and genomes.

Protein Databases

Conserved Domains
protein domains represented by sequence alignments and profiles for protein domains conserved in evolution. It includes alignments of the domains to known three-dimensional protein structures.
Molecular Modeling Database (MMDB) contains experimental data from crystallographic and NMR structure determinations. The data for MMDB are obtained from the Protein Data Bank (PDB)

Gene Expression Omnibus

GEO Datasets
gene expression data sets from the Gene Expression Omnibus (GEO) repository of microarray data
GEO Profiles
individual gene expression profiles assembled from GEO
nucleic acid reagents designed for use in a wide variety of biomedical research applications including genotyping, gene expression studies, SNP discovery, genome mapping, and gene silencing


names and phylogenetic lineages of the more than 350,000 species that have molecular data in the NCBI databases
MeSH (Medical Subject Headings)
controlled vocabulary and classification system (ontology) used for indexing articles in PubMed. MeSH terminology provides a consistent way to retrieve information that may use different terminology for the same concepts


database of citations and abstracts for biomedical literature from MEDLINE and additional life science journals
PubMed Central
(PMC) is the U.S. National Library of Medicine's digital archive of life sciences journal literature. PMC contains full-text manuscripts deposited by authors or articles provided by the publisher


full-text books that can be searched online and that are linked to PubMed records
NCBI Web Site Search
database of static NCBI web pages, documentation, and online tools
NLM Catalog
records for books, journals, audiovisuals, computer software, electronic resources, and other materials in the National Library of Medicine (NLM) collections


genome assemblies. The same genome can have several versions
genes from completely sequenced genomes and that have an active research community to contribute gene-specific data
sequence and map data from the whole genomes. They represent both completely sequenced genomes and those with sequencing in-progress


complete and incomplete (in-progress) large-scale molecular projects including genome sequencing and assembly, transcriptome, metagenomic, annotation, expression and mapping projects.
contains descriptions of biological source materials used in studies that have data in other NCBI molecular databases such as Assembly, Nucleotide and SRA.

Other databases

  • MedGen
  • ClinVar
  • OMIM
  • PopSet
  • PubChem BioAssay
  • PubChem Compound
  • PubChem Substance
  • UniGene
  • GTR
  • BioSystems

Searching into NCBI

“Clipboard” and “My Collections”

The Clipboard is a temporary place on the NCBI website to save records.

  • limited to 500 items on each database
  • lost after eight hours of inactivity

My Collections that is a part of the My NCBI service is a more permanent place to save records.

You need to create an NCBI account to use My NCBI. It is easy and free

Pre-computed answers

There are two major kinds of relationships in the NCBI website:

  • computationally derived associations within a database (neighbors)
  • relationships based on information present on the records themselves (hard links)

Combining neighbors and hard links can be an especially effective method for navigating across data and finding the most useful information


NCBI Entrez queries

Searching NCBI has much more options than Google

(do you know Google options?)

By default the query text is searched in any part of any database

But you can specify the fields where you are looking for

  • Title of a paper
  • author
  • date
  • taxonomic id

Entrez Examples

protease NOT hiv1[organism]
This will limit the search to all proteases, except those in HIV 1.
This limits the search to entries with lengths between 1000 to 2000 bases for nucleotide entries, or 1000 to 2000 residues for protein entries.

Entrez Examples

Mus musculus[organism] AND biomol_mrna[properties]
This limits the search to mouse mRNA entries in the database. For common organisms, one can also select from the pulldown menu.

Entrez Examples

This limits the search to protein sequences with calculated molecular weight between 10 kD to 100 kD.
src specimen voucher[properties]
This limits the search to entries that are annotated with a /specimen_voucher qualifier on the source feature.

Entrez Examples

all[filter] NOT environmental sample[filter] NOT metagenomes[orgn]
This excludes sequences from metagenome studies and uncultured sequences from anonymous environmental sample studies

Creating advanced queries

Quotes " are important

The fields are written inside brackets []

Each database page includes an Advanced Search option

Combining queries

Entrez queries can be single words, short phrases, sentences, database identifiers, gene symbols, or names

AND: Finds documents that contain terms on both sides of the operator terms. The intersection of both searches.

OR: Finds documents that contain either term. The union of both searches.

NOT: Finds documents that contain the term on the left but not the term on the right of the operator. The subtraction of the right side from the left side


AND must be in uppercase. It is recommended to also use uppercase for OR and NOT

  • Operators are processed left-to-right

      promoters OR response elements NOT human AND mammals
  • Parenthesis can be used to control the evaluation order

      g1p3 AND (response element OR promoter)

Dates and Other Ranges

  • Certain fields can accept ranges of values

    • Publication Date, Modification Date, Accession, Molecular Weight, and Sequence Length
  • Low and high numbers are entered with a colon “:” between them followed by the field

      110:500[Sequence Length]
      2015/3/1:2016/4/30[Publication Date]

NCBI online documentation

We can get a different explanation in the public documentation made by NCBI Mod_Workshops/2016/June_UWashington/ Workshop1_Navigating_NCBI/

All documents made by NCBI are public domain