Class 15: Probabilities

Computing for Molecular Biology 2

Andrés Aravena, PhD

30 April 2021

Not about games and bets

People think that probabilities are about games

Instead they are really tools for thinking

Thinking about decisions when we have incomplete information

Thinking about the future

About the meaning of our experiment results

Do you need to understand probabilities?

Do you travel?

  • Which is the safest way to travel?
  • Which is the fastest way to travel?
  • How much do you want to pay for safety?
  • How much do you want to pay for speed?

Do you buy insurance?

Travel insurance, Health insurance, Sigorta

  • how much do you pay for it?
  • How much you will get paid?
  • Is it worth it?

Why do you study?

  • You can work now and earn money today
  • What if you fail this course?
  • Why if you do not find a job?

Will you make experiments?

  • How will you understand the results?
  • Are the results just “random noise”?
  • What are your expected results
    • Allele frequency
    • Success rate of a treatment

All these questions are about probabilities

To understand the idea we will use games

  • Cards
  • Coins
  • Dice (one die, many dice)

maybe other toys that are easy to understand

These are just to have easy examples

What can happen: outcomes

Each device has a set of possible outcomes:

For example a die has the following outcomes

⚀ ⚁ ⚂ ⚃ ⚄ ⚅

Cards

🂡 🂢 🂣 🂤 🂥 🂦 🂧 🂨 🂩 🂪 🂫 🂭 🂮 🂱 🂲 🂳 🂴 🂵 🂶 🂷 🂸 🂹 🂺 🂻 🂽 🂾 🃁 🃂 🃃 🃄 🃅 🃆 🃇 🃈 🃉 🃊 🃋 🃍 🃎 🃑 🃒 🃓 🃔 🃕 🃖 🃗 🃘 🃙 🃚 🃛 🃝 🃞 🃟

Also Cards

♠︎♣︎♡♢

Four symbols can be used to represent DNA

A, C, G, T

Coins

Head, Tail also written as H, T

Doing the experiment is easy

just throw the dice

Simulating the experiment in R

We know how to represent outcomes

  • Coin: c("H","T")
  • Dice: 1:6
  • Cards/DNA: c("a","c","g","t")
  • Capital Letters: LETTERS
  • Small Letters: letters

Please take a sample()

Try this

sample(LETTERS)
 [1] "E" "I" "B" "Q" "A" "L" "H" "V" "K" "Z" "T"
[12] "U" "J" "D" "N" "P" "O" "G" "C" "F" "M" "X"
[23] "S" "W" "Y" "R"
sample(LETTERS)
 [1] "G" "I" "K" "A" "X" "L" "D" "Q" "O" "S" "N"
[12] "P" "U" "R" "Y" "V" "E" "B" "H" "C" "M" "F"
[23] "J" "Z" "T" "W"

sample() is shuffling

The output has the same elements of the input but in a different order

Each element appears only once

The order changes every time

Set of possible outcomes

To use sample() we must give it a set of posssible outcomes

(In math this is called Ω, for short)

The result of sample() is called outcome or realization

Choosing the sample size

Try this

sample(LETTERS, size=10)
 [1] "Q" "M" "V" "L" "O" "F" "W" "E" "Y" "J"

We get 10 letters

Some, but not all possible outcomes

Each outcome appears only once

Sampling many times

Try this

sample(c("A","C","G","T"), size=10)
Error in sample.int(length(x), size, replace, prob): cannot take a sample larger than the population when 'replace = FALSE'

Problem: We run out of outcomes. What can we do?

Sampling with replacement

Try this

sample(c("A", "C", "G", "T"), size=10, replace=TRUE)
 [1] "T" "A" "T" "C" "C" "C" "T" "G" "G" "G"

Each element can appear several times

Shuffle, take one, replace it on the set

Most of the times we will use sample() with replace=TRUE

What if G+C>A+T

Different proportions of each outcome

If there are more G anc C than A and T we can try

sample(c("A", "C", "C", "G", "G", "T"), size=10, replace=TRUE)
 [1] "C" "A" "G" "G" "G" "C" "T" "A" "T" "G"

but this becomes hard if the proporitons are not a nice fraction

Choosing the proportions

If there are more G anc C than A and T we can try

sample(c("A", "C", "G", "T"), prob=c(1, 2, 2, 1)/6, 
        size=10, replace=TRUE)
 [1] "T" "C" "T" "C" "C" "C" "G" "T" "C" "T"

The input prob= must be a vector of the same size as the set of outcomes

The sum of proportions must be 1
(the computer can do it for us)

Probabilities are the proportions

Each experiment or random system may have some preferred outcomes

All outcomes are possible, but some can be probable

We want to know the probabilities of each outcome

Sampling with and without replacement

We have three cases

  • The set of possible outcomes is small and we do not replace
    • each outcome changes the proportions a lot
  • The set of possible outcomes is large and we do not replace
    • each outcome changes the proportions very little
  • The set of possible outcomes has any size and we replace
    • outcomes do not change the proportions

Sampling with replacement represents a very large population

and that is why we use replace=TRUE

Is there a pattern in the result?

Each result is different

table(sample(c("A","C","G","T"), size=40, replace=TRUE))

 A  C  G  T 
 9  7 12 12 
table(sample(c("A","C","G","T"), size=40, replace=TRUE))

 A  C  G  T 
 8  5 13 14 
table(sample(c("A","C","G","T"), size=40, replace=TRUE))

 A  C  G  T 
 8 11 11 10 

If size increases, frequencies increase

table(sample(c("A","C","G","T"), size=400, replace=TRUE))

  A   C   G   T 
 91 103  93 113 
table(sample(c("A","C","G","T"), size=400, replace=TRUE))

  A   C   G   T 
107  93 109  91 
table(sample(c("A","C","G","T"), size=400, replace=TRUE))

  A   C   G   T 
 98  94  98 110 

Larger size, frequencies change less

table(sample(c("A","C","G","T"), size=4000, replace=TRUE))

   A    C    G    T 
1021  976  977 1026 
table(sample(c("A","C","G","T"), size=4000, replace=TRUE))

   A    C    G    T 
1020  984 1022  974 
table(sample(c("A","C","G","T"), size=4000, replace=TRUE))

   A    C    G    T 
 995 1052  986  967 

Each frequency is very close to 1/4 of size

table(sample(c("A","C","G","T"), size=40000, replace=TRUE))

    A     C     G     T 
 9991 10022 10004  9983 
table(sample(c("A","C","G","T"), size=40000, replace=TRUE))

    A     C     G     T 
10172  9980  9854  9994 
table(sample(c("A","C","G","T"), size=40000, replace=TRUE))

    A     C     G     T 
10074  9923 10012  9991 

Is there a pattern in the result?

table(sample(c("A","C","G","T"), size=400000, replace=TRUE))

     A      C      G      T 
 99750 100468 100031  99751 
table(sample(c("A","C","G","T"), size=400000, replace=TRUE))

     A      C      G      T 
100002  99875 100449  99674 
table(sample(c("A","C","G","T"), size=400000, replace=TRUE))

     A      C      G      T 
 99953 100445  99751  99851 

If size increases a lot, the relative frequencies are 1/4 each

The sum of Relative Frequencies is 1

Absolute frequency is how many times we see each value
The sum of all absolute frequencies is the Total number of cases

Relative frequency is the proportion of each value in the total
The sum of all relative frequencies is always 1.

  • Relative frequency = Absolute frequency/Total number of cases

Relative Frequencies in our example

table(sample(c("a", "c", "g", "t"), size=1000000, replace=TRUE))/1000000

       a        c        g        t 
0.249790 0.249719 0.250288 0.250203 

What is the final relative frequency?

What will be each relative frequency when size is BIG?

We will see that the relative frequency will converge to the probability

In complex systems we do not know each probability

But we can estimate it using the relative frequency

That is what we will do in this course