Class 2: Finding data online

Bioinformatics

Andrés Aravena

September 30, 2021

We need data

International Nucleotide Sequence Database Collaboration

There are three large data repositories

National Center for Biotechnology Information, NCBI
- National Library of Medicine
  - National Institutes of Health, USA
European Bioinformatics Institute, EMBL-EBI
- European Molecular Biology Laboratory
DNA Data Bank of Japan (DDBJ)
- National Institute of Genetics (NIG) Japan

They all have the same data

These three databases interchange all sequence data
but they may have different structure

All data is available for free

Research payed with public money must be uploaded here

Good journals also require to upload data

The NCBI website

GenBank and RefSeq

GenBank: genetic sequence database, an annotated collection of all publicly available DNA sequences. Anybody can upload directly.
RefSeq: curated subset of GenBank

DNA Databases

Nucleotide: most of the sequence data from GenBank, except environmental
SRA: sequencing data from the next generation sequencing platforms

Protein Databases

Protein: amino acid sequences from the translations of coding regions provided on nucleotide records in GenBank, also imported from the outside data sources (PIR, UniProtKB/Swiss-Prot, Protein Data Bank)
Protein Clusters: collection of related protein sequences (clusters) consisting of Reference Sequence proteins encoded by complete prokaryotic genomes, eukaryotic organelle plasmids and genomes.

Protein Databases

Conserved Domains: protein domains represented by sequence alignments and profiles for protein domains conserved in evolution. It includes alignments of the domains to known three-dimensional protein structures.
Structure: Molecular Modeling Database (MMDB) contains experimental data from crystallographic and NMR structure determinations. The data for MMDB are obtained from the Protein Data Bank (PDB)

Gene Expression Omnibus

GEO Datasets: gene expression data sets from the Gene Expression Omnibus (GEO) repository of microarray data
GEO Profiles: individual gene expression profiles assembled from GEO
Probe: nucleic acid reagents designed for use in a wide variety of biomedical research applications including genotyping, gene expression studies, SNP discovery, genome mapping, and gene silencing

Databases

Taxonomy: names and phylogenetic lineages of the more than 350,000 species that have molecular data in the NCBI databases
MeSH (Medical Subject Headings): controlled vocabulary and classification system (ontology) used for indexing articles in PubMed. MeSH terminology provides a consistent way to retrieve information that may use different terminology for the same concepts

Literature

PubMed: database of citations and abstracts for biomedical literature from MEDLINE and additional life science journals
PubMed Central: (PMC) is the U.S. National Library of Medicine's digital archive of life sciences journal literature. PMC contains full-text manuscripts deposited by authors or articles provided by the publisher

Literature

Bookshelf: full-text books that can be searched online and that are linked to PubMed records
NCBI Web Site Search: database of static NCBI web pages, documentation, and online tools
NLM Catalog: records for books, journals, audiovisuals, computer software, electronic resources, and other materials in the National Library of Medicine (NLM) collections

Databases

Assembly: genome assemblies. The same genome can have several versions
Gene: genes from completely sequenced genomes and that have an active research community to contribute gene-specific data
Genome: sequence and map data from the whole genomes. They represent both completely sequenced genomes and those with sequencing in-progress

Databases

BioProject: complete and incomplete (in-progress) large-scale molecular projects including genome sequencing and assembly, transcriptome, metagenomic, annotation, expression and mapping projects.
BioSample: contains descriptions of biological source materials used in studies that have data in other NCBI molecular databases such as Assembly, Nucleotide and SRA.

Other databases

MedGen
ClinVar
OMIM
PopSet
PubChem BioAssay
PubChem Compound
PubChem Substance
UniGene
GTR
BioSystems

Searching into NCBI

“Clipboard” and “My Collections”

The Clipboard is a temporary place on the NCBI website to save records.

limited to 500 items on each database
lost after eight hours of inactivity

My Collections that is a part of the My NCBI service is a more permanent place to save records.

You need to create an NCBI account to use My NCBI. It is easy and free

Pre-computed answers

There are two major kinds of relationships in the NCBI website:

computationally derived associations within a database (neighbors)
relationships based on information present on the records themselves (hard links)

Combining neighbors and hard links can be an especially effective method for navigating across data and finding the most useful information

Queries

NCBI Entrez queries

Searching NCBI has much more options than Google

(do you know Google options?)

By default the query text is searched in any part of any database

But you can specify the fields where you are looking for

Title of a paper
author
date
taxonomic id

Entrez Examples

protease NOT hiv1[organism]: This will limit the search to all proteases, except those in HIV 1.
1000:2000[slen]: This limits the search to entries with lengths between 1000 to 2000 bases for nucleotide entries, or 1000 to 2000 residues for protein entries.

Entrez Examples

Mus musculus[organism] AND biomol_mrna[properties]: This limits the search to mouse mRNA entries in the database. For common organisms, one can also select from the pulldown menu.

Entrez Examples

10000:100000[mlwt]: This limits the search to protein sequences with calculated molecular weight between 10 kD to 100 kD.
src specimen voucher[properties]: This limits the search to entries that are annotated with a /specimen_voucher qualifier on the source feature.

Entrez Examples

all[filter] NOT environmental sample[filter] NOT metagenomes[orgn]: This excludes sequences from metagenome studies and uncultured sequences from anonymous environmental sample studies

Creating advanced queries

Quotes " are important

The fields are written inside brackets []

Each database page includes an Advanced Search option

Combining queries

Entrez queries can be single words, short phrases, sentences, database identifiers, gene symbols, or names

AND: Finds documents that contain terms on both sides of the operator terms. The intersection of both searches.

OR: Finds documents that contain either term. The union of both searches.

NOT: Finds documents that contain the term on the left but not the term on the right of the operator. The subtraction of the right side from the left side

Example

AND must be in uppercase. It is recommended to also use uppercase for OR and NOT

Operators are processed left-to-right

  promoters OR response elements NOT human AND mammals

Parenthesis can be used to control the evaluation order
```
  g1p3 AND (response element OR promoter)
```

Dates and Other Ranges

Certain fields can accept ranges of values
- Publication Date, Modification Date, Accession, Molecular Weight, and Sequence Length
Low and high numbers are entered with a colon “:” between them followed by the field
```
  110:500[Sequence Length]
  2015/3/1:2016/4/30[Publication Date]
```

NCBI online documentation

We can get a different explanation in the public documentation made by NCBI

https://ftp.ncbi.nlm.nih.gov/pub/education/ Mod_Workshops/2016/June_UWashington/ Workshop1_Navigating_NCBI/

All documents made by NCBI are public domain