Class 4: Automatization of NCBI searches

Bioinformatics

Andrés Aravena

October 07, 2021

Other taxonomies

There are other classifications used in biology

  • EC numbers are a taxonomy of enzymatic reactions
  • Clusters of Proteins are a taxonomy of proteins

Ontologies instead of taxonomies

In a taxonomy the relationships are “is-a”

  • a primate is a mammal
  • a mammal is a vertebrate
  • a vertebrate is an animal

In an ontology other relationships are possible. For example

  • a toe belongs to a feet
  • a feet belongs to a leg

Gene Ontology

This is probably the most important ontology for molecular biologists3™

GFF

Pipelines

Pipelines: putting all together

When we design molecular biology experiments, or when we analyze their results, we need to use several tools in chain

Today we are going to see an example using the NCBI website

How many genes are in each organism?

Filter results: only Legumes

Most of times is a good idea to check the Taxonomy database

Each sequence on GenBank is tagged with a taxon id

Using taxid is more precise than using common names

For example, a protein from human can be labeled “95% similar to mouse”

Is that a human or a mouse protein?

Downloading

For your convenience you can download the sequences

  • Decide Format
  • Decide Content

In this case we only need accession ids

Save your search strategy

It is essential that your protocol can be replicated

It is a very good idea to save the search strategy in a file

It is also wise to save the output in a text file

Separate by tab or by comma

It is boring to do it one by one

And takes a lot of time

It is easy to make mistakes

It is hard to replicate

Can we do it automatically?

E-tools: Entrez Pipelines

ESearch -> ESummary;
ESearch -> EFetch;
EPost -> ESummary;
EPost -> EFetch;
ESearch -> ELink;
EPost -> ELink;
EPost -> ESearch;
ELink -> ESearch;
ESearch -> ELink -> ESummary;
ESearch -> ELink -> EFetch;
EPost -> ESearch -> ESummary;
EPost -> ESearch -> EFetch;
EPost -> ELink -> ESearch -> ESummary;
EPost -> ELink -> ESearch -> EFetch;

Map of E-tools

Use your favorite language

There are Entrez libraries for most languages

For example in R it is called rentrez

There is a command line version, and versions for all major computer languages

Videos

Tutorials: General NCBI: Download a custom set of records OC74-DpkWjE 191 20100524 Tutorials: General NCBI: Retrieve Sequences for an Organism sK3ykyInU8o 96 20100524 Tutorials: General E-Utilities Introduction BCG-M5k-gvE 226 20120413 Tutorials: PubMed Use MeSH to Build a Better PubMed Query uyF8uQY9wys 183 20130214 Tutorials: PubMed PubMed Advanced Search Builder dncRQ1cobdc 147 20111212 Tutorials: PubMed Need the Full Text Article? b0Rk_zmMaWw 123 20130424 Tutorials: PubMed PubMed: The Filters Sidebar 696R9GbOyvA 122 20121210 Tutorials: PubMed PubMed Commons KcsaEPAaS4o 726 20150602 Tutorials: PubMed Pubmed for Scientists iTW9Gboters 2719 20151124 Tutorials: PubMed Finding Genes in PubMed WtDTOeI9wB8 710 20151207 Tutorials: PubMed How You and Your Journal Club Can Contribute Using PubMed Commons dpM1S4XLgak 768 20170713 Tutorials: PubMed PubMed Tools and ORCID ID’s for Authors IXzPOfN2WF4 1375 20171011 Tutorials: PubMed A New PubMed: Highlights for Information Professionals O0Dg8eGfeRg 2212 20191002 Tutorials: PubMed PubMed: Using the Advanced Search Builder IHhTDqiNQK8 192 20200727