Class 12: Course Summary

Bioinformatics

Andrés Aravena

January 4, 2024

The big picture

Each day has 24 hours

We better choose well what to do wit our time

Our needs, according to A. Maslow

Levels of the pyramid

  • Physiological needs: air, food, water, sleep
  • Safety Needs: Security, Order, and Stability
  • Love and Belonging: family and friends
  • Esteem: comfortable about success, be recognized
  • Cognitive: intellectual stimulation, exploration
  • Aesthetic: harmony, order and beauty
  • Self-realization: morality, creativity, problem solving, acceptance, spontaneity
  • Transcendence: connection to “oneself, to significant others, to human beings in general, to other species, to nature, and to the cosmos”

A job “just for safety” is boring

😒😞😔😟😕🙁😣😖😫😩😢😭😵

Let’s do Science

Scientist work is to understand Nature

We start by Observing Nature, usually measuring values.

These are exploratory experiments.

We study this in other courses.

The thing we study must be reproducible, and we need to see that repetition.

We can find them using plots, linear models, clustering, etc.

This is the most important part.

Good answers to bad questions are useless.

Good questions are good, even if we don’t have answers

We answer these questions using models and explanations

Valid models should make predictions that we can test in the lab…

These are validation experiments.

If the results do not match the prediction, we know that the explanation is wrong. Two steps back.

Now we publish our data and model, so other scientists validate or reject it.

The final validation is to be published.

If the paper is accepted and published, our work becomes part of our shared human knowledge.

The goal of Science is to produce new Knowledge.

When we observe Nature we use our previous Knowledge

We look for new Patterns that raise new Questions.

“Noise becomes Signal”

Four Paradigms of Science

1 Empiric

  • Observation of isolated facts
  • Description of relationships
  • e.g. Botany, naming stars

2 Theoretical

  • Abstract models and theories
  • Usually expressed in mathematical formulas
  • Correct predictions validate the models
  • e.g. Mendel’s laws of inheritance, Newton’s law of Gravity

Four Paradigms of Science

3 Simulation Based

  • Models that cannot be expressed in formulas
  • Formulas that cannot be solved
  • e.g. Protein structure prediction, Genetic Algorithms

4 Data Based

  • Discovering patterns hidden in data
  • Huge volumes of data
  • Complex interactions
  • e.g. Bioinformatics, Data mining

Summary of this course

Before midterm

  1. Why do we care about Bioinformatics?
    What is and what is not Bioinformatics
  2. Taxonomy. How to understand the universe.
  3. Comparing sequences. How many generic codes?
  4. Global and Local Alignment.
    How to know if (parts of) two sequences are similar.
  5. Finding Local Alignments.
    Looking for local matches is different from global ones. We need to use scores. They make more biological sense.
  6. Understanding BLAST. It is not like Google. Results depends on the options you choose. What are the options?

After midterm

  1. Trees representing distance.
    Trees are used to represent phylogenetic relationships
  2. Multiple Sequence Alignment.
    What is conserved among several sequences? What are the polymorphisms?
  3. Phylogenetic Trees.
    Building a time machine, and failing.
  4. DNA Melting Temperature. Stability.
  5. Designing primers. Specificity, selectivity.

Missing parts

How to find patterns without aligning

DNA Sequencing

How can we know the genome of an organism?

What is the quality of a DNA read? What does quality 30 means?

FASTQ file format for sequences with quality scores

Mapping Reads to Reference

Reads are mapped to the genome using one of many tools (bowtie, bwa, hisat)

All these programs give their results in SAM/BAM format. What is that format?

What are SAM files? What are they used for?

What is the difference between SAM and BAM files?

Exercises

PCR test for COVID

Tested in a sample on patients, as follows

COVID+ COVID- Total
PCR+ 135 2 137
PCR- 3 216 219
Total 138 218 356

How many false positives we have? How many false negatives?
What is the sensitivity of this test? What is the specificity?

Harder question

If you apply the previous test on a random person in the street (not showing any symptoms), and the test is positive, what is the probability that the person really has COVID?

Use the sensitivity and specificity values from the previous question, and assume that the prevalence of COVID is 1%.

Trees

Please write the NEWICK code for this tree

Draw this tree

(((F:73,A:41):19,(C:30,D:55):82):14,(B:78,E:90):48);

Algorithm vs Heuristic

What is the difference?

Why do we need both?

Examples of each one