Class 2: Finding data online


Andrés Aravena

23 October 2020

We need data

International Nucleotide Sequence Database Collaboration

There are three large data repositories

  • National Center for Biotechnology Information, NCBI
    • National Library of Medicine
      • National Institutes of Health, USA
  • European Bioinformatics Institute, EMBL-EBI
    • European Molecular Biology Laboratory
  • DNA Data Bank of Japan (DDBJ)
    • National Institute of Genetics (NIG) Japan

They all have the same data

These three databases interchange all sequence data
but they may have different structure

All data is available for free

Research payed with public money must be uploaded here

Good journals also require to upload data

The NCBI website

GenBank and RefSeq

genetic sequence database, an annotated collection of all publicly available DNA sequences. Anybody can upload directly.
curated subset of GenBank

DNA Databases

most of the sequence data from GenBank, except environmental
sequencing data from the next generation sequencing platforms

Protein Databases

amino acid sequences from the translations of coding regions provided on nucleotide records in GenBank, also imported from the outside data sources (PIR, UniProtKB/Swiss-Prot, Protein Data Bank)
Protein Clusters
collection of related protein sequences (clusters) consisting of Reference Sequence proteins encoded by complete prokaryotic genomes, eukaryotic organelle plasmids and genomes.

Protein Databases

Conserved Domains
protein domains represented by sequence alignments and profiles for protein domains conserved in evolution. It includes alignments of the domains to known three-dimensional protein structures.
Molecular Modeling Database (MMDB) contains experimental data from crystallographic and NMR structure determinations. The data for MMDB are obtained from the Protein Data Bank (PDB)

Gene Expression Omnibus

GEO Datasets
gene expression data sets from the Gene Expression Omnibus (GEO) repository of microarray data
GEO Profiles
individual gene expression profiles assembled from GEO
nucleic acid reagents designed for use in a wide variety of biomedical research applications including genotyping, gene expression studies, SNP discovery, genome mapping, and gene silencing


names and phylogenetic lineages of the more than 350,000 species that have molecular data in the NCBI databases
MeSH (Medical Subject Headings)
controlled vocabulary and classification system (ontology) used for indexing articles in PubMed. MeSH terminology provides a consistent way to retrieve information that may use different terminology for the same concepts


database of citations and abstracts for biomedical literature from MEDLINE and additional life science journals
PubMed Central
(PMC) is the U.S. National Library of Medicine's digital archive of life sciences journal literature. PMC contains full-text manuscripts deposited by authors or articles provided by the publisher


full-text books that can be searched online and that are linked to PubMed records
NCBI Web Site Search
database of static NCBI web pages, documentation, and online tools
NLM Catalog
records for books, journals, audiovisuals, computer software, electronic resources, and other materials in the National Library of Medicine (NLM) collections


genome assemblies. The same genome can have several versions
genes from completely sequenced genomes and that have an active research community to contribute gene-specific data
sequence and map data from the whole genomes. They represent both completely sequenced genomes and those with sequencing in-progress


(Expressed Sequence Tag) sequences from GenBank. Typically short single-pass reads from cDNA libraries generated in survey projects
(Genome Survey Sequence) from GenBank. These are the genomic equivalent of EST records


automatically generated sets of homologous genes and their corresponding mRNA, genomic, and protein sequence data from selected eukaryotic organisms.
(Single Nucleotide Polymorphism) database is a central repository for single nucleotide polymorphisms, microsatellites, and small-scale insertions and deletions


complete and incomplete (in-progress) large-scale molecular projects including genome sequencing and assembly, transcriptome, metagenomic, annotation, expression and mapping projects.
contains descriptions of biological source materials used in studies that have data in other NCBI molecular databases such as Assembly, Nucleotide and SRA. <!– BioSystems
interacting sets of biomolecules involved in metabolic and signaling pathways, disease states, and other biological processes –>

Other databases

  • MedGen
  • ClinVar
  • OMIM
  • PopSet
  • PubChem BioAssay
  • PubChem Compound
  • PubChem Substance
  • UniGene
  • GTR

Let’s take a look