Class 14: Probabilities

April 13, 2018

Reminder of the previous class

Sampling with replacement

Try this

sample(c("H","T"), size=10, replace=TRUE)

 [1] "T" "T" "H" "T" "H" "H" "H" "H" "H" "H"

Each element can appear several times

Shuffle, take one, replace it on the set

Most of times we will use sample() with replace=TRUE

Is there a pattern in the result?

table(sample(c("a","c","g","t"), size=40, replace=TRUE))

 a  c  g  t 
 9  9 12 10

table(sample(c("a","c","g","t"), size=40, replace=TRUE))

 a  c  g  t 
10 16  6  8

table(sample(c("a","c","g","t"), size=40, replace=TRUE))

 a  c  g  t 
 8 10 13  9

Each result is different

Is there a pattern in the result?

table(sample(c("a","c","g","t"), size=400, replace=TRUE))

  a   c   g   t 
102 103 106  89

table(sample(c("a","c","g","t"), size=400, replace=TRUE))

  a   c   g   t 
 89 103  99 109

table(sample(c("a","c","g","t"), size=400, replace=TRUE))

  a   c   g   t 
102 110  86 102

When size increases, the frequency of each letter also increases

Is there a pattern in the result?

table(sample(c("a","c","g","t"), size=4000, replace=TRUE))

   a    c    g    t 
1054  972  994  980

table(sample(c("a","c","g","t"), size=4000, replace=TRUE))

   a    c    g    t 
1031 1076  974  919

table(sample(c("a","c","g","t"), size=4000, replace=TRUE))

   a    c    g    t 
1050 1012  994  944

When size increases, the frequencies change less

Is there a pattern in the result?

table(sample(c("a","c","g","t"), size=40000, replace=TRUE))

    a     c     g     t 
10011 10009 10027  9953

table(sample(c("a","c","g","t"), size=40000, replace=TRUE))

    a     c     g     t 
 9983 10025  9878 10114

table(sample(c("a","c","g","t"), size=40000, replace=TRUE))

    a     c     g     t 
10012  9954 10092  9942

Each frequency is very close to 1/4 of size

Is there a pattern in the result?

table(sample(c("a","c","g","t"), size=400000, replace=TRUE))

     a      c      g      t 
 99546 100646  99717 100091

table(sample(c("a","c","g","t"), size=400000, replace=TRUE))

     a      c      g      t 
100154 100070  99835  99941

table(sample(c("a","c","g","t"), size=400000, replace=TRUE))

     a      c      g      t 
 99841  99757 100285 100117

If size increases a lot, the relative frequencies are 1/4 each

The sum of Relative Frequencies is 1

Absolute frequency is how many times we see each value
The sum of all absolute frequencies is the Number of cases
Relative frequency is the proportion of each value in the total
The sum of all relative frequencies is always 1.
Relative frequency = Absolute frequency/Number of cases

Relative Frequencies in our example

table(sample(c("a", "c", "g", "t"), size=1000000, replace=TRUE))/1000000

       a        c        g        t 
0.250086 0.249995 0.250020 0.249899

What is the final relative frequency?

What will be each relative frequency when size is BIG

In this case we can find it by thinking

There are 4 possible outcomes, and they are symmetrical
There is no reason to prefer one outcome to the other
Therefore all relative frequencies must be equal
Therefore each one must be 1/4

This ideal relative frequency is called Probability

Probabilities

Each device or random system may have some preferred outcomes

All outcomes are possible, but some can be probable

In general we do not know each probability

But we can estimate it using the relative frequency

That is what we will do in this course

Today we will see more definitions

Be sure of understand them

If italic text
then write it and Google it

Population

Population is a very very big set of things

All people in Turkey
All humans living today
All humans that ever lived
All living organisms in the past and in the future
All the infinite possible results of an experiment

You can assume that population size is infinite

Repeating an experiment generates a population

We can do and redo an experiment for ever

(we just need a lot of money and grandchildren)

We can throw a dice 🎲 forever

All the results are a population

The things in the population are outcomes

For example

⚀ ⚁ ⚂ ⚃ ⚄ ⚅

A, C, G, T

Different outcomes may have different proportions

For example

A, C, C, G, G, T

A, A, C, C, C, G, G, G, T

These proportions are called probabilities

For A, C, C, G, G, T the proportions are

P(A)=1/6, P(C)=1/3, P(G)=1/3, P(T)=1/6

What about A, A, C, C, C, G, G, G, T?

The proportion of each outcome in the population is its probability

But we cannot see all the population

Normally it is easy to know the possible outcomes

Normally it is hard to know the probabilities

Knowing the probabilities is knowing the population

Probabilities describe what we know about the population

Probabilities describe our knowledge of the population

We want to know the population

If we know the probabilities, we know something about

All people in Turkey
All humans living today
All humans that ever lived
All living organisms in the past and in the future

This is why we do Science

A sample is not the population

If we make a single experiment, we learn about the experiment

But we do not learn the truth about the population

We need several experiments to learn about the population

For example someone can say

“My grandpa smoked and lived 102 years”

Does that mean that smoking is healthy?

Why medicine is not science

Medicine cares about people. Science cares about knowledge

Each patient is an individual case

Scientific knowledge is useful for medicine

But medicine is about healing each one

Science is about everybody, not each one

Science and Medicine are opposite sides of the same coin

How do we find the probabilities

There are two different ways of figuring out probabilities

We can use the brain, pen and paper
- That is, we can do math
We can use our hands, computers and tools
- That is, we can do experiments

Here we will do (mostly) the second way

Experiments produce Samples

Each experiment gives us some outcomes

They are random but connected to the population

A Sample is a small part of population

finite number
changes every time
- if we take two samples they will be different

Samples give us Empirical Frequencies

Some people even say “empirical probabilities”

table(sample(c("a","c","g","t"), size=40, replace=TRUE))/40

    a     c     g     t 
0.250 0.225 0.250 0.275

table(sample(c("a","c","g","t"), size=4000, replace=TRUE))/4000

      a       c       g       t 
0.25425 0.24550 0.25000 0.25025

table(sample(c("a","c","g","t"), size=400000, replace=TRUE))/400000

        a         c         g         t 
0.2502300 0.2501450 0.2496825 0.2499425

Empirical Frequencies are close to Probabilities

The result of our experiments give us empirical frequencies

They are close to 1/4, the theoretical probabilities

When size is bigger, the empirical frequencies are closer and closer to the real probabilities

We know for sure that when size grows we will get the probabilities

But size has to be really big

We are absolutely sure about this

How can be really sure that when size grows we will get the probabilities?

How do we know?

We know because people has proven a Theorem

It is called Law of Large Numbers

Theorems are Eternal Truth

Mathematics is not really about numbers

Mathematics is about theorems

Finding the logical consequences of what we know

But it is all in our mind

Experiments give us Nature without Truth

Math gives us Truth without Nature

Science gives us Truth about Nature

Samples are connected to populations

The main consequence of the Law of Large Numbers is

Samples tell us something about populations

Therefore we can learn about populations if we do experiments

In our course experiment means sample(x, size, replace=TRUE)

In most cases the outcomes are anything

Nucleotides

sample(c("a","c","g","t"), size, replace=TRUE)

Amino-acids:

sample(seqinr::a(), size, replace=TRUE)`

Alleles:

sample(c("AA","Aa","aa"), size, replace=TRUE)`

Sometimes the outcomes are numbers

In that case instead of writing

sample(1:100, size, replace=TRUE)`

we can write

sample.int(100, size, replace=TRUE)`

(replace 100 by any natural number)

Let’s experiment with two dice 🎲 🎲

We throw two dice. What is the sum 🎲 +🎲 ?

dice <- function() {
    return(sample.int(6, size=1, replace=TRUE))
}
dice() + dice()

[1] 7

Try it. What is your result?

We need to do more experiments

One experiment is meaningless

We need to replicate the experiment

We can use replicate(n, expression)

replicate(15, dice() + dice())

 [1]  5 12  7  9  5  4  7  4  6  4  3  7  8  3  5

Let’s do 200 experiments

replicate(200, dice() + dice())

  [1]  7  5  6  4  8  3  7 11  9  8  8  9  7  4  3  8  6  7  5  9  8  7  4  6  6
 [26]  9  3  3  7  4  5 10  5 12  2  9  7 10  3  8  7  9  8  6  4  9  5  8  4  8
 [51]  5 10  6  9  8  8  3  9  9  8  4 11  8 10  6  8  5  5  9  6  6  5  5  7 10
 [76]  6  9  8  3  7 10  3  9 11  8  6  2  4 10  4  7  7  8  6 10  7  9  7  8  6
[101] 10  6  7  9  5 10  8  3  6  7  9  6  4 11  8  3  7  6 11  5  8  4  7  7 11
[126]  2  7  8  3  5  7  8  7  8 10  4  2  6  4  4  6  2  5 11  6 10  6  9  7  9
[151]  5  7  8  8  9  3  9  9  8  7  7  8  3  8  5 11  8  7  4  3  4  7  7 10 10
[176]  9  2  9 11  7  9  6  8  3  5  7  7  3 10  6  7  6  6  8  8  7  3  9  7  4

We can calculate the frequency

table(replicate(200, dice() + dice()))

 2  3  4  5  6  7  8  9 10 11 12 
 5 10 22 25 28 30 18 22 21 11  8

table(replicate(200, dice() + dice()))/200

    2     3     4     5     6     7     8     9    10    11    12 
0.045 0.055 0.115 0.115 0.120 0.140 0.135 0.105 0.080 0.045 0.045

We can even make a plot

barplot(table(replicate(200, dice() + dice()))/200)

What is the probability of `dice() + dice()=7`

Our approximate answer is

prob <- table(replicate(200, dice() + dice()))/200
prob["7"]

   7 
0.22

Notice that `prob["7"]` is not `prob[7]`

Remember that we can use text and numbers as indices.

Here prob is

prob

    2     3     4     5     6     7     8     9    10    11    12 
0.020 0.055 0.080 0.095 0.105 0.220 0.180 0.095 0.060 0.070 0.020

What is prob[12]? Be careful

In summary

Repeating an experiment generates a population
Probabilities describe our knowledge about the population
Experiments produce Samples
Samples give us Empirical Frequencies
Empirical Frequencies are close to Probabilities
We can use the computer to get Empirical Frequencies

Reminder of the previous class

Sampling with replacement

Is there a pattern in the result?

Is there a pattern in the result?

Is there a pattern in the result?

Is there a pattern in the result?

Is there a pattern in the result?

The sum of Relative Frequencies is 1

Relative Frequencies in our example

What is the final relative frequency?

Probabilities

Today we will see more definitions

Be sure of understand them

If italic text then write it and Google it

Population

Repeating an experiment generates a population

The things in the population are outcomes

Different outcomes may have different proportions

These proportions are called probabilities

The proportion of each outcome in the population is its probability

But we cannot see all the population

Probabilities describe our knowledge of the population

We want to know the population

A sample is not the population

Why medicine is not science

Science and Medicine are opposite sides of the same coin

How do we find the probabilities

Experiments produce Samples

Samples give us Empirical Frequencies

Empirical Frequencies are close to Probabilities

We are absolutely sure about this

Theorems are Eternal Truth

Samples are connected to populations

In most cases the outcomes are anything

Sometimes the outcomes are numbers

Let’s experiment with two dice 🎲 🎲

We need to do more experiments

Let’s do 200 experiments

We can calculate the frequency

We can even make a plot

What is the probability of dice() + dice()=7

Notice that prob["7"] is not prob[7]

In summary

If italic text
then write it and Google it

What is the probability of `dice() + dice()=7`

Notice that `prob["7"]` is not `prob[7]`