Class 23: Things that you must know

Computing for Molecular Biology 2

Andrés Aravena, PhD

4 June 2021

Spanish word for Random is Azar

Randomness is important to every culture

Things you have to know

We have three levels of knowledge

  • Knowing that …
  • Knowing how to …
  • Understand

We need the three levels

You must know that …

  • An experiment is a random process
    • We don’t know the result until we do the experiment
  • An experiment produces a single outcome
    • Cells survival, concentration of DNA, temperature
    • Outcome may be logic, factor, numeric or character

You must know that …

  • Replicating the experiment produces a sample

    • A sample is a vector
  • Experiments produce samples

  • We care about populations

You must know that …

Populations are BIG.

  • Like “all people in the planet” or “all experiments in all parallel universes”

  • In the class we use known populations

  • In real life populations are unknown (partially)

You must understand …

  • Experiments give samples, but we care about populations

  • What happens in the sample depends on the population

You must understand …

  • By looking at the samples we can learn about the population

  • Populations are big

    • We assume they have infinite size to make calculation easier

You must know how to

  • Simulate experiments using the sample() function

  • Prepare the outcomes vector

  • Use sample(outcomes, size=n) to get n random elements from outcomes

You must know that …

  • Most times (but not always) we use replace=TRUE

  • This allows size bigger than length(outcomes)

  • More important: probabilities do not change

You must know that …

  • When population is small, sampling will change the population
    • We do not do this in this course
  • When population is big, sampling has a very small effect
    • The difference between 1E8 and 1E8 -1 is not important

You must know that …

  • When populations have infinite size, taking a sample has no effect

  • When we replace the sample, there is no effect on the population

  • Sampling with replacement is the same as having an infinite population

You must understand …

  • Samples are never the same, but they are similar

  • Bigger sample sizes will produce more similar results

You must understand …

  • Samples tell us something about the population

You must know that …

  • The probability of an outcome is the proportion of that outcome in the population

  • In real life we usually do not know the probabilities, and we want to find them

You must know that …

  • In some cases we do know the probability of each outcome

  • Then we can simulate the experiment

You must know that …

  • When we use sample, each outcome can have a different probability
  • The probability distribution is a vector with the same length as outcomes
  • We can use the option prob= to change the probability distribution
  • If we do not use prob=, then all outcomes have the same probability

You must know how to …

  • Decompose a complex random process in smaller parts
  • Simulate a complex random system and find the empirical frequencies
  • Draw the results in a bar plot
  • Use the option prob=p to change how sample() works

You must know that …

  • These simulations are called “Monte-Carlo Methods”

  • This allow us to explore cases that have too many combinations

You must know that …

  • We cannot see all possible genes of length 1000bp

  • There are 41000 = 22000 = 210x200 ≈ 103x200 = 10600 combinations

  • The age of the universe is ≈ 4.32x1017 seconds

You must understand …

  • Testing all possible cases is impossible

  • Random sampling allows us to get an idea of all possible cases

  • More simulations give better approximations, but take longer time

  • This is one of the most common uses of computers in Science

Events: You must know that …

  • An event is any logical question about the experiment outcome
    • such as “two people have the same birthday”
  • In R we can use functions, taking an outcome and returning TRUE or FALSE

Events: You must know that …

Outcomes and events are different

  • An event can be true for several different outcomes

  • An experiment produces only one outcome, and several events

Events: You must know how to …

Write a function to represent an event

  • Takes a sample (one or more outcomes) as input
  • Returns TRUE or FALSE depending on the event rule
  • For example:
    • Two people having the same birthday
    • A student passes this course

You must know that …

  • A random variable is any numeric value that depends on the experiment outcome
    • such as “the number of people with epilepsy in our course”
  • In R we can use a numeric outcomes vector

You must know that …

  • In this case there is a
    • population average,
    • population variance, and
    • population standard deviation
  • In general we do not know the population average, and we want to know it

You must know that …

The population standard deviation measures the population width

Chebyshev theorem says \[ℙ(|x_i-\bar{\mathbf x}|≥ k⋅\text{sd}(\mathbf x))≤ 1/k^2\] It can also be written as \[ℙ(|x_i-\bar{\mathbf x}|≤ k⋅\text{sd}(\mathbf x))≥ 1-1/k^2\]

Chebyshev: You must know how to …

Find the population width for different values of \(k\)

At least 75% of the population is near the average, by no more than 2 times the standard deviation \[ℙ(|x_i-\bar{\mathbf{x}}|≤ 2⋅\text{sd}(\mathbf x))≥ 1-1/2^2\] \[ℙ(\bar{\mathbf{x}} -2⋅\text{sd}(\mathbf x)≤ x_i ≤ \bar{\mathbf{x}} +2⋅\text{sd}(\mathbf x)) ≥ 0.75\]

Chebyshev: You must know how to …

At least 88.9% of the population is near the average, by less than 3 times the standard deviation \[ℙ(\bar{\mathbf{x}} -3⋅\text{sd}(\mathbf x)≤ x_i ≤ \bar{\mathbf{x}} +3⋅\text{sd}(\mathbf x)) ≥ 0.889\]

Exercise

Which value of \(k\) will give you an interval containing at least 99% of the population?

(this 99% is called confidence level)

You must understand …

Everything we measure will be in an interval

The interval depends on the population standard deviation and the confidence level

You must understand …

Chebyshev theorem is always true, but in some cases is pessimistic

In some cases we can have better confidence levels

You must know that …

  • You can take the average of a sample to estimate the population average

  • The sample average is a random variable. Changes on every experiment

You must know that …

  • When the sample is bigger, the sample average will be closer to the population average
    • This is called Law of Large Numbers
  • Moreover, the sample average of a big sample will follow a Normal distribution
    • This is called Central Limit Theorem

Normal distribution

Here outcomes are real numbers

Any real number is possible

Probability of any \(x\) is zero (!)

We look for probabilities of intervals

Probabilities of Normal Distribution

≈95% of normal population is between \(-2⋅\text{sd}(\mathbf x)\) and \(2⋅\text{sd}(\mathbf x)\)
≈99% of normal population is between \(-3⋅\text{sd}(\mathbf x)\) and \(3⋅\text{sd}(\mathbf x)\)

You must understand that …

  • The Chebyshev rule is always valid, but pessimist
    • confidence intervals are big
  • If the probability distribution is Normal, we can have better confidence intervals

You must understand that …

  • Not all probabilities are Normal
  • When the random process is a sum of many parts, then we may have a Normal distribution
    • That happens with experimental measurements
    • Also happens in many biological processes

Finding Normal confidence interval

If we have 95% of population in the center, then we have 2.5% to the left and 2.5% to the right

We can find the \(k\) value using R

qnorm(0.025)
[1] -1.959964
qnorm(0.975)
[1] 1.959964

Finding Normal confidence interval (in general)

If we have \(1-\alpha\) of population in the center, then we have \(\alpha/2\) to the left and \(\alpha/2\) to the right

qnorm(alpha/2)
qnorm(1-alpha/2)

Now you can find any interval

The problem with confidence

You may have noticed that we never get 100% confidence

That is a fact of life. We have to accept

To have very high confidence, we need wide intervals

But wide intervals are less useful