April 13, 2018

## Sampling with replacement

Try this

sample(c("H","T"), size=10, replace=TRUE)
  "T" "T" "H" "T" "H" "H" "H" "H" "H" "H"

Each element can appear several times

Shuffle, take one, replace it on the set

Most of times we will use sample() with replace=TRUE

## Is there a pattern in the result?

table(sample(c("a","c","g","t"), size=40, replace=TRUE))
 a  c  g  t
9  9 12 10 
table(sample(c("a","c","g","t"), size=40, replace=TRUE))
 a  c  g  t
10 16  6  8 
table(sample(c("a","c","g","t"), size=40, replace=TRUE))
 a  c  g  t
8 10 13  9 

Each result is different

## Is there a pattern in the result?

table(sample(c("a","c","g","t"), size=400, replace=TRUE))
  a   c   g   t
102 103 106  89 
table(sample(c("a","c","g","t"), size=400, replace=TRUE))
  a   c   g   t
89 103  99 109 
table(sample(c("a","c","g","t"), size=400, replace=TRUE))
  a   c   g   t
102 110  86 102 

When size increases, the frequency of each letter also increases

## Is there a pattern in the result?

table(sample(c("a","c","g","t"), size=4000, replace=TRUE))
   a    c    g    t
1054  972  994  980 
table(sample(c("a","c","g","t"), size=4000, replace=TRUE))
   a    c    g    t
1031 1076  974  919 
table(sample(c("a","c","g","t"), size=4000, replace=TRUE))
   a    c    g    t
1050 1012  994  944 

When size increases, the frequencies change less

## Is there a pattern in the result?

table(sample(c("a","c","g","t"), size=40000, replace=TRUE))
    a     c     g     t
10011 10009 10027  9953 
table(sample(c("a","c","g","t"), size=40000, replace=TRUE))
    a     c     g     t
9983 10025  9878 10114 
table(sample(c("a","c","g","t"), size=40000, replace=TRUE))
    a     c     g     t
10012  9954 10092  9942 

Each frequency is very close to 1/4 of size

## Is there a pattern in the result?

table(sample(c("a","c","g","t"), size=400000, replace=TRUE))
     a      c      g      t
99546 100646  99717 100091 
table(sample(c("a","c","g","t"), size=400000, replace=TRUE))
     a      c      g      t
100154 100070  99835  99941 
table(sample(c("a","c","g","t"), size=400000, replace=TRUE))
     a      c      g      t
99841  99757 100285 100117 

If size increases a lot, the relative frequencies are 1/4 each

## The sum of Relative Frequencies is 1

• Absolute frequency is how many times we see each value
• The sum of all absolute frequencies is the Number of cases

• Relative frequency is the proportion of each value in the total
• The sum of all relative frequencies is always 1.

• Relative frequency = Absolute frequency/Number of cases

## Relative Frequencies in our example

table(sample(c("a", "c", "g", "t"), size=1000000, replace=TRUE))/1000000
       a        c        g        t
0.250086 0.249995 0.250020 0.249899 

## What is the final relative frequency?

What will be each relative frequency when size is BIG

In this case we can find it by thinking

• There are 4 possible outcomes, and they are symmetrical
• There is no reason to prefer one outcome to the other
• Therefore all relative frequencies must be equal
• Therefore each one must be 1/4

This ideal relative frequency is called Probability

## Probabilities

Each device or random system may have some preferred outcomes

All outcomes are possible, but some can be probable

In general we do not know each probability

But we can estimate it using the relative frequency

That is what we will do in this course

## Population

Population is a very very big set of things

• All people in Turkey
• All humans living today
• All humans that ever lived
• All living organisms in the past and in the future
• All the infinite possible results of an experiment

You can assume that population size is infinite

## Repeating an experiment generates a population

We can do and redo an experiment for ever

(we just need a lot of money and grandchildren)

We can throw a dice 🎲 forever

All the results are a population

For example

⚀ ⚁ ⚂ ⚃ ⚄ ⚅

A, C, G, T

## Different outcomes may have different proportions

For example

A, C, C, G, G, T

A, A, C, C, C, G, G, G, T

## These proportions are called probabilities

For A, C, C, G, G, T the proportions are

P(A)=1/6, P(C)=1/3, P(G)=1/3, P(T)=1/6

What about A, A, C, C, C, G, G, G, T?

## But we cannot see all the population

Normally it is easy to know the possible outcomes

Normally it is hard to know the probabilities

Knowing the probabilities is knowing the population

Probabilities describe what we know about the population

## We want to know the population

If we know the probabilities, we know something about

• All people in Turkey
• All humans living today
• All humans that ever lived
• All living organisms in the past and in the future

This is why we do Science

## A sample is not the population

If we make a single experiment, we learn about the experiment

But we do not learn the truth about the population

We need several experiments to learn about the population

For example someone can say

“My grandpa smoked and lived 102 years”

Does that mean that smoking is healthy?

## Why medicine is not science

Each patient is an individual case

Scientific knowledge is useful for medicine

But medicine is about healing each one

Science is about everybody, not each one

## How do we find the probabilities

There are two different ways of figuring out probabilities

• We can use the brain, pen and paper
• That is, we can do math
• We can use our hands, computers and tools
• That is, we can do experiments

Here we will do (mostly) the second way

## Experiments produce Samples

Each experiment gives us some outcomes

They are random but connected to the population

A Sample is a small part of population

• finite number
• changes every time
• if we take two samples they will be different

## Samples give us Empirical Frequencies

Some people even say “empirical probabilities”

table(sample(c("a","c","g","t"), size=40, replace=TRUE))/40
    a     c     g     t
0.250 0.225 0.250 0.275 
table(sample(c("a","c","g","t"), size=4000, replace=TRUE))/4000
      a       c       g       t
0.25425 0.24550 0.25000 0.25025 
table(sample(c("a","c","g","t"), size=400000, replace=TRUE))/400000
        a         c         g         t
0.2502300 0.2501450 0.2496825 0.2499425 

## Empirical Frequencies are close to Probabilities

The result of our experiments give us empirical frequencies

They are close to 1/4, the theoretical probabilities

When size is bigger, the empirical frequencies are closer and closer to the real probabilities

We know for sure that when size grows we will get the probabilities

But size has to be really big

How can be really sure that when size grows we will get the probabilities?

How do we know?

We know because people has proven a Theorem

It is called Law of Large Numbers

## Theorems are Eternal Truth

Mathematics is not really about numbers

Finding the logical consequences of what we know

But it is all in our mind

Experiments give us Nature without Truth

Math gives us Truth without Nature

Science gives us Truth about Nature

## Samples are connected to populations

The main consequence of the Law of Large Numbers is

Samples tell us something about populations

Therefore we can learn about populations if we do experiments

In our course experiment means sample(x, size, replace=TRUE)

## In most cases the outcomes are anything

• Nucleotides
sample(c("a","c","g","t"), size, replace=TRUE)
• Amino-acids:
sample(seqinr::a(), size, replace=TRUE)
• Alleles:
sample(c("AA","Aa","aa"), size, replace=TRUE)

## Sometimes the outcomes are numbers

In that case instead of writing

sample(1:100, size, replace=TRUE)

we can write

sample.int(100, size, replace=TRUE)

(replace 100 by any natural number)

## Let’s experiment with two dice 🎲 🎲

We throw two dice. What is the sum 🎲 +🎲 ?

dice <- function() {
return(sample.int(6, size=1, replace=TRUE))
}
dice() + dice()
 7

Try it. What is your result?

## We need to do more experiments

One experiment is meaningless

We need to replicate the experiment

We can use replicate(n, expression)

replicate(15, dice() + dice())
   5 12  7  9  5  4  7  4  6  4  3  7  8  3  5

## Let’s do 200 experiments

replicate(200, dice() + dice())
    7  5  6  4  8  3  7 11  9  8  8  9  7  4  3  8  6  7  5  9  8  7  4  6  6
  9  3  3  7  4  5 10  5 12  2  9  7 10  3  8  7  9  8  6  4  9  5  8  4  8
  5 10  6  9  8  8  3  9  9  8  4 11  8 10  6  8  5  5  9  6  6  5  5  7 10
  6  9  8  3  7 10  3  9 11  8  6  2  4 10  4  7  7  8  6 10  7  9  7  8  6
 10  6  7  9  5 10  8  3  6  7  9  6  4 11  8  3  7  6 11  5  8  4  7  7 11
  2  7  8  3  5  7  8  7  8 10  4  2  6  4  4  6  2  5 11  6 10  6  9  7  9
  5  7  8  8  9  3  9  9  8  7  7  8  3  8  5 11  8  7  4  3  4  7  7 10 10
  9  2  9 11  7  9  6  8  3  5  7  7  3 10  6  7  6  6  8  8  7  3  9  7  4

## We can calculate the frequency

table(replicate(200, dice() + dice()))
 2  3  4  5  6  7  8  9 10 11 12
5 10 22 25 28 30 18 22 21 11  8 
table(replicate(200, dice() + dice()))/200
    2     3     4     5     6     7     8     9    10    11    12
0.045 0.055 0.115 0.115 0.120 0.140 0.135 0.105 0.080 0.045 0.045 

## We can even make a plot

barplot(table(replicate(200, dice() + dice()))/200) ## What is the probability of dice() + dice()=7

prob <- table(replicate(200, dice() + dice()))/200
prob["7"]
   7
0.22 

## Notice that prob["7"] is not prob

Remember that we can use text and numbers as indices.

Here prob is

prob
    2     3     4     5     6     7     8     9    10    11    12
0.020 0.055 0.080 0.095 0.105 0.220 0.180 0.095 0.060 0.070 0.020 

What is prob? Be careful

## In summary

• Repeating an experiment generates a population
• Probabilities describe our knowledge about the population
• Experiments produce Samples
• Samples give us Empirical Frequencies
• Empirical Frequencies are close to Probabilities
• We can use the computer to get Empirical Frequencies