Class 1: Why do we care about Bioinformatics?

Bioinformatics

Andrés Aravena

September 27, 2022

Welcome to “Bioinformatics”

Today’s ideas

  • What “Bioinformatics” is and is not
  • Why you should care
  • How to get bioinformatic data for free
  • What kind of data we can get
  • What is important in the data

Bioinformatics

what it is and what it isn’t

Molecular Biology 101

{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE, dev.args=list(bg="transparent"), fig.align="center", dev="png", cache=FALSE)

  • DNA
  • RNA
  • Proteins
  • Metabolism

What is Bioinformatics?

  • Genomics
    • sequences of DNA, RNA, AA
  • Transcriptomics
    • gene’s expression
  • Proteomics
    • 3D structure and interactions
  • Metabolomics
    • metabolites

What Bioinformatics is not?

  • Using computers in a hospital
  • Handling patient information
  • Laboratory Information Management
  • Microscope image analysis

Big picture

for this course

Genomics

  • DNA sequencing
  • Pairwise Alignment
  • Multiple Alignment
  • Genome Assembly
  • Primer design
  • Finding Binding Sites

Transcriptomics

Measuring gene expression

  • qPCR
  • Microarrays
  • RNAseq

Mostly about statistics

Proteomics

  • Find protein sequence
    • mass spectrometry
  • Find protein structures
    • X-ray diffraction analysis
    • Computational Biology prediction
  • Find protein-protein interactions

What we should do here

  • Role
  • Concepts
  • Statistics
  • Access
  • Tools
  • Pathways
  • Metagenomics
  • Scripting
  • Software
  • Computational environment

Sayres, et al. “Bioinformatics Core Competencies for Undergraduate Life Sciences Education.”
PLoS ONE 13, no. 6 (2018): 1–20. https://doi.org/10.1371/journal.pone.0196878.

Role

Understand the role of computation and data mining in hypothesis-driven processes within the life sciences

Concepts

Understand computational concepts used in bioinformatics

  • meaning of algorithm
  • bioinformatics file formats

Statistics

Know statistical concepts used in bioinformatics

  • E-value
  • z-scores
  • t test
  • type-1 and type-2 error

Access genomic data

Know how to access genomic data

  • NCBI nucleotide databases
  • EBI

Use genomic Tools

Be able to use bioinformatics tools to analyze genomic data

  • BLASTN
  • genome browser

Access expression

Know how to access gene expression data

  • UniGene
  • GEO
  • SRA

Tools expression

Be able to use bioinformatics tools to analyze gene expression data

  • GeneSifter
  • David
  • ORF Finder

Access proteomic data

Know how to access proteomic data

  • NCBI protein databases

Tools proteomic

Be able to use bioinformatics tools to examine protein structure and function

  • BLASTP
  • Cn3D
  • PyMol

Access metabolomic

Know how to access metabolomic and systems biology data

  • Human Metabolome Database

Pathways

Be able to use bioinformatics tools to examine the flow of molecules within pathways/networks

  • Gene Ontology
  • KEGG

Metagenomics

Be able to use bioinformatics tools to examine metagenomics data

  • MEGA
  • MUSCLE

Scripting

Know how to write short computer programs as part of the scientific discovery process

  • write a script to analyze sequence data

Software

Be able to use software packages to manipulate and analyze bioinformatics data

  • Geneious
  • Vector NTI Express
  • spreadsheets

Computational environment

Operate in a variety of computational environments to manipulate and analyze bioinformatics data

  • Mac OS, Windows
  • web- or cloud-based
  • Unix/Linux command line

What we really do here

We focus on How to understand results

  • Role: What is bioinformatics
  • Access: using NCBI, EBI
  • Concepts: file formats and more
  • Tools: understanding tools output
  • Statistics: E-values, error type-1 and type-2

More Concepts

  • Pairwise Alignment
    • Global
    • Semi-global
    • Local
  • Multiple Alignment
    • Cost
    • Heuristics
  • Trees
    • Taxonomy
    • Phylogenetic
    • Ontology

Why you should care

about bioinformatics

Technology changes fast

{r fig.width=4.5, fig.height=5.5} library(readr) sequencingcostdata <- read_delim("../../../static/sequencingcostdata.txt", "\t", escape_double = FALSE, col_types = cols(Date = col_date(format = "%b-%y")), trim_ws = TRUE) library(ggplot2) qplot(x=Date, y=`Cost per Genome`, data=sequencingcostdata, log="y", colour="red") + geom_ribbon(fill="red", alpha=0.2, aes(ymin=1e3, ymax=`Cost per Genome`)) + theme(legend.position="none", plot.background = element_rect(fill = "transparent", colour = NA))

In 2001, the cost of sequencing the first human genome was USD 108

Today you can have your own genome for 1000 USD

The problem is no longer how to do the experiment

Instead is how do we make sense of the results

Manual jobs are now done by computers

Will a robot replace you?

Four Paradigms of Science

According to Microsoft

1 Empiric

(since prehistoric times)

  • observation of isolated facts
  • description of related facts
  • e.g. Botany, naming stars, Arab astronomers, Galileo, Tycho Brahe, Carl Linneaus

2. Theoretical

(Renaissance)

  • Abstract models and theories
  • Usually expressed in mathematical formulas
  • Correct predictions validate the models
  • e.g. Mendel laws of inheritance, Darwin natural selection theory, Kepler law of planet’s motion, Newton’s law of Gravity

3. Simulation Based

(Mid 20th century)

  • Models that cannot be expressed in formulas
  • Formulas that cannot be solved
  • e.g. Protein structure prediction, three body problem, galaxy modeling
  • Computational Astronomy, Computational Biology

4. Data Based

(21st century)

  • Discovering patterns hidden in data
  • Huge volumes of data
  • Complex interactions
  • e.g. Bioinformatics, Astroinformatics, Data mining
  • Big Data, Machine Learning

We need data

International Nucleotide Sequence Database Collaboration

There are three large data repositories

  • National Center for Biotechnology Information, NCBI
    • National Library of Medicine
      • National Institutes of Health, USA
  • European Bioinformatics Institute, EMBL-EBI
    • European Molecular Biology Laboratory
  • DNA Data Bank of Japan (DDBJ)
    • National Institute of Genetics (NIG) Japan

They all have the same data

These three databases interchange all sequence data
but they may have different structure

All data is available for free

Research payed with public money must be uploaded here

Good journals also require to upload data