April 13, 2018

Try this

sample(c("H","T"), size=10, replace=TRUE)

[1] "T" "T" "H" "T" "H" "H" "H" "H" "H" "H"

Each element can appear several times

Shuffle, take one, replace it on the set

Most of times we will use `sample()`

with `replace=TRUE`

table(sample(c("a","c","g","t"), size=40, replace=TRUE))

a c g t 9 9 12 10

table(sample(c("a","c","g","t"), size=40, replace=TRUE))

a c g t 10 16 6 8

table(sample(c("a","c","g","t"), size=40, replace=TRUE))

a c g t 8 10 13 9

Each result is different

table(sample(c("a","c","g","t"), size=400, replace=TRUE))

a c g t 102 103 106 89

table(sample(c("a","c","g","t"), size=400, replace=TRUE))

a c g t 89 103 99 109

table(sample(c("a","c","g","t"), size=400, replace=TRUE))

a c g t 102 110 86 102

When `size`

increases, the frequency of each letter also increases

table(sample(c("a","c","g","t"), size=4000, replace=TRUE))

a c g t 1054 972 994 980

table(sample(c("a","c","g","t"), size=4000, replace=TRUE))

a c g t 1031 1076 974 919

table(sample(c("a","c","g","t"), size=4000, replace=TRUE))

a c g t 1050 1012 994 944

When `size`

increases, the frequencies change less

table(sample(c("a","c","g","t"), size=40000, replace=TRUE))

a c g t 10011 10009 10027 9953

table(sample(c("a","c","g","t"), size=40000, replace=TRUE))

a c g t 9983 10025 9878 10114

table(sample(c("a","c","g","t"), size=40000, replace=TRUE))

a c g t 10012 9954 10092 9942

Each frequency is very close to 1/4 of `size`

table(sample(c("a","c","g","t"), size=400000, replace=TRUE))

a c g t 99546 100646 99717 100091

table(sample(c("a","c","g","t"), size=400000, replace=TRUE))

a c g t 100154 100070 99835 99941

table(sample(c("a","c","g","t"), size=400000, replace=TRUE))

a c g t 99841 99757 100285 100117

If `size`

increases a lot, the *relative frequencies* are 1/4 each

**Absolute frequency**is**how many**times we see each valueThe sum of all

**absolute frequencies**is the**Number of cases****Relative frequency**is**the proportion**of each value in the totalThe sum of all

**relative frequencies**is always 1.*Relative frequency = Absolute frequency/Number of cases*

table(sample(c("a", "c", "g", "t"), size=1000000, replace=TRUE))/1000000

a c g t 0.250086 0.249995 0.250020 0.249899

What will be each relative frequency when `size`

is
BIG

In this case we can find it by **thinking**

- There are 4 possible outcomes, and they are
*symmetrical* - There is no reason to prefer one outcome to the other
- Therefore all relative frequencies must be equal
- Therefore each one must be 1/4

This *ideal* relative frequency is called **Probability**

Each *device* or *random system* may have some preferred outcomes

All *outcomes* are possible, but some can be *probable*

In general we do not know each probability

But we can *estimate* it using the relative frequency

**That is what we will do in this course**

then

*Population* is a very very big set of **things**

- All people in Turkey
- All humans living today
- All humans that ever lived
- All living organisms in the past and in the future
- All the infinite possible results of an experiment

You can assume that *population size* is *infinite*

We can do and redo an experiment for ever

(we just need a lot of money and grandchildren)

We can throw a dice 🎲 forever

All the results are a *population*

For example

⚀ ⚁ ⚂ ⚃ ⚄ ⚅

A, C, G, T

For example

A, C, C, G, G, T

A, A, C, C, C, G, G, G, T

For A, C, C, G, G, T the proportions are

P(A)=1/6, P(C)=1/3, P(G)=1/3, P(T)=1/6

What about A, A, C, C, C, G, G, G, T?

Normally it is easy to know the possible *outcomes*

Normally it is hard to know the *probabilities*

Knowing the probabilities is knowing the population

*Probabilities* describe what we **know** about the population

If we know the probabilities, we know something about

- All people in Turkey
- All humans living today
- All humans that ever lived
- All living organisms in the past and in the future

This is why we do **Science**

If we make a single experiment, we learn about the experiment

But we do not learn **the truth** about the population

We need several experiments to learn about the population

For example someone can say

“My grandpa smoked and lived 102 years”

Does that mean that smoking is healthy?

Medicine cares about people. Science cares about knowledge

Each patient is an individual case

Scientific knowledge is useful for medicine

But medicine is about healing **each one**

Science is about **everybody**, not **each one**

There are **two** different ways of figuring out *probabilities*

- We can use the brain, pen and paper
- That is, we can do
*math*

- That is, we can do
- We can use our hands, computers and tools
- That is, we can do
*experiments*

- That is, we can do

Here we will do (mostly) the second way

Each experiment gives us some *outcomes*

They are *random* but connected to the population

A *Sample* is a small part of population

- finite number
- changes every time
- if we take two samples they will be different

Some people even say “empirical probabilities”

table(sample(c("a","c","g","t"), size=40, replace=TRUE))/40

a c g t 0.250 0.225 0.250 0.275

table(sample(c("a","c","g","t"), size=4000, replace=TRUE))/4000

a c g t 0.25425 0.24550 0.25000 0.25025

table(sample(c("a","c","g","t"), size=400000, replace=TRUE))/400000

a c g t 0.2502300 0.2501450 0.2496825 0.2499425

The result of our experiments give us *empirical frequencies*

They are close to 1/4, the theoretical probabilities

When `size`

is bigger, the empirical frequencies are closer and closer to the real probabilities

We know **for sure** that when `size`

grows we will get the probabilities

But `size`

has to be **really big**

How can be really sure that when `size`

grows we will get the probabilities?

How do we know?

We know because people has proven a *Theorem*

It is called *Law of Large Numbers*

Mathematics is not really about numbers

Mathematics is about theorems

Finding the logical consequences of what we know

But it is all in our mind

Experiments give us Nature without Truth

Math gives us Truth without Nature

Science gives us Truth about Nature

The main consequence of the *Law of Large Numbers* is

Samples tell us something about populations

Therefore we can learn about *populations* if we do **experiments**

In our course **experiment** means `sample(x, size, replace=TRUE)`

- Nucleotides

sample(c("a","c","g","t"), size, replace=TRUE)

- Amino-acids:

sample(seqinr::a(), size, replace=TRUE)`

- Alleles:

sample(c("AA","Aa","aa"), size, replace=TRUE)`

In that case instead of writing

sample(1:100, size, replace=TRUE)`

we can write

sample.int(100, size, replace=TRUE)`

(replace 100 by any natural number)

We throw two dice. What is the sum 🎲 +🎲 ?

dice <- function() { return(sample.int(6, size=1, replace=TRUE)) } dice() + dice()

[1] 7

Try it. What is your result?

One experiment is meaningless

We need to *replicate* the experiment

We can use `replicate(n, expression)`

replicate(15, dice() + dice())

[1] 5 12 7 9 5 4 7 4 6 4 3 7 8 3 5

replicate(200, dice() + dice())

[1] 7 5 6 4 8 3 7 11 9 8 8 9 7 4 3 8 6 7 5 9 8 7 4 6 6 [26] 9 3 3 7 4 5 10 5 12 2 9 7 10 3 8 7 9 8 6 4 9 5 8 4 8 [51] 5 10 6 9 8 8 3 9 9 8 4 11 8 10 6 8 5 5 9 6 6 5 5 7 10 [76] 6 9 8 3 7 10 3 9 11 8 6 2 4 10 4 7 7 8 6 10 7 9 7 8 6 [101] 10 6 7 9 5 10 8 3 6 7 9 6 4 11 8 3 7 6 11 5 8 4 7 7 11 [126] 2 7 8 3 5 7 8 7 8 10 4 2 6 4 4 6 2 5 11 6 10 6 9 7 9 [151] 5 7 8 8 9 3 9 9 8 7 7 8 3 8 5 11 8 7 4 3 4 7 7 10 10 [176] 9 2 9 11 7 9 6 8 3 5 7 7 3 10 6 7 6 6 8 8 7 3 9 7 4

table(replicate(200, dice() + dice()))

2 3 4 5 6 7 8 9 10 11 12 5 10 22 25 28 30 18 22 21 11 8

table(replicate(200, dice() + dice()))/200

2 3 4 5 6 7 8 9 10 11 12 0.045 0.055 0.115 0.115 0.120 0.140 0.135 0.105 0.080 0.045 0.045

barplot(table(replicate(200, dice() + dice()))/200)

`dice() + dice()=7`

Our **approximate** answer is

prob <- table(replicate(200, dice() + dice()))/200 prob["7"]

7 0.22

`prob["7"]`

is not `prob[7]`

Remember that we can use text and numbers as indices.

Here `prob`

is

prob

2 3 4 5 6 7 8 9 10 11 12 0.020 0.055 0.080 0.095 0.105 0.220 0.180 0.095 0.060 0.070 0.020

What is `prob[12]`

? Be careful

- Repeating an experiment generates a
*population* *Probabilities*describe our**knowledge**about the*population*- Experiments produce Samples
- Samples give us
*Empirical Frequencies* *Empirical Frequencies*are close to*Probabilities*- We can use the computer to get
*Empirical Frequencies*