Class 2: Finding data online

Bioinformatics

Andrés Aravena

23 October 2020

We need data

International Nucleotide Sequence Database Collaboration

There are three large data repositories

National Center for Biotechnology Information, NCBI
- National Library of Medicine
  - National Institutes of Health, USA
European Bioinformatics Institute, EMBL-EBI
- European Molecular Biology Laboratory
DNA Data Bank of Japan (DDBJ)
- National Institute of Genetics (NIG) Japan

They all have the same data

These three databases interchange all sequence data
but they may have different structure

All data is available for free

Research payed with public money must be uploaded here

Good journals also require to upload data

The NCBI website

GenBank and RefSeq

GenBank: genetic sequence database, an annotated collection of all publicly available DNA sequences. Anybody can upload directly.
RefSeq: curated subset of GenBank

DNA Databases

Nucleotide: most of the sequence data from GenBank, except environmental
SRA: sequencing data from the next generation sequencing platforms

Protein Databases

Protein: amino acid sequences from the translations of coding regions provided on nucleotide records in GenBank, also imported from the outside data sources (PIR, UniProtKB/Swiss-Prot, Protein Data Bank)
Protein Clusters: collection of related protein sequences (clusters) consisting of Reference Sequence proteins encoded by complete prokaryotic genomes, eukaryotic organelle plasmids and genomes.

Protein Databases

Conserved Domains: protein domains represented by sequence alignments and profiles for protein domains conserved in evolution. It includes alignments of the domains to known three-dimensional protein structures.
Structure: Molecular Modeling Database (MMDB) contains experimental data from crystallographic and NMR structure determinations. The data for MMDB are obtained from the Protein Data Bank (PDB)

Gene Expression Omnibus

GEO Datasets: gene expression data sets from the Gene Expression Omnibus (GEO) repository of microarray data
GEO Profiles: individual gene expression profiles assembled from GEO
Probe: nucleic acid reagents designed for use in a wide variety of biomedical research applications including genotyping, gene expression studies, SNP discovery, genome mapping, and gene silencing

Databases

Taxonomy: names and phylogenetic lineages of the more than 350,000 species that have molecular data in the NCBI databases
MeSH (Medical Subject Headings): controlled vocabulary and classification system (ontology) used for indexing articles in PubMed. MeSH terminology provides a consistent way to retrieve information that may use different terminology for the same concepts

Literature

PubMed: database of citations and abstracts for biomedical literature from MEDLINE and additional life science journals
PubMed Central: (PMC) is the U.S. National Library of Medicine's digital archive of life sciences journal literature. PMC contains full-text manuscripts deposited by authors or articles provided by the publisher

Literature

Bookshelf: full-text books that can be searched online and that are linked to PubMed records
NCBI Web Site Search: database of static NCBI web pages, documentation, and online tools
NLM Catalog: records for books, journals, audiovisuals, computer software, electronic resources, and other materials in the National Library of Medicine (NLM) collections

Databases

Assembly: genome assemblies. The same genome can have several versions
Gene: genes from completely sequenced genomes and that have an active research community to contribute gene-specific data
Genome: sequence and map data from the whole genomes. They represent both completely sequenced genomes and those with sequencing in-progress

Databases

EST: (Expressed Sequence Tag) sequences from GenBank. Typically short single-pass reads from cDNA libraries generated in survey projects
GSS: (Genome Survey Sequence) from GenBank. These are the genomic equivalent of EST records

Databases

HomoloGene: automatically generated sets of homologous genes and their corresponding mRNA, genomic, and protein sequence data from selected eukaryotic organisms.
SNP: (Single Nucleotide Polymorphism) database is a central repository for single nucleotide polymorphisms, microsatellites, and small-scale insertions and deletions

Databases

BioProject: complete and incomplete (in-progress) large-scale molecular projects including genome sequencing and assembly, transcriptome, metagenomic, annotation, expression and mapping projects.
BioSample: contains descriptions of biological source materials used in studies that have data in other NCBI molecular databases such as Assembly, Nucleotide and SRA. <!– BioSystems; interacting sets of biomolecules involved in metabolic and signaling pathways, disease states, and other biological processes –>

Other databases

MedGen
ClinVar
OMIM
PopSet
PubChem BioAssay
PubChem Compound
PubChem Substance
UniGene
GTR

Class 2: Finding data online

Bioinformatics

Andrés Aravena

23 October 2020

We need data

International Nucleotide Sequence Database Collaboration

They all have the same data

The NCBI website

GenBank and RefSeq

DNA Databases

Protein Databases

Protein Databases

Gene Expression Omnibus

Databases

Literature

Literature

Databases

Databases

Databases

Databases

Other databases

Let’s take a look