April 25, 2018

We can **simulate** what can be a course like us

epilepsy <- function(size) { return(sample(c("Yes","No"), size=size, replace = TRUE, prob = c(0.01, 0.99))) }

Here we fix the **real** probability to 1%

Here `courses`

is a **matrix**, not a data frame

courses <- replicate(1000, epilepsy(97)) courses[,723] # show replica 723

[1] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" [13] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" [25] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" [37] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" [49] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" [61] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "Yes" [73] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" [85] "No" "No" "No" "Yes" "No" "No" "No" "No" "No" "No" "No" "No" [97] "No"

There are several ways to do this.

For example we can use `table()`

table(courses[,723])["Yes"]

What happens if there isn’t any `"Yes"`

? We get `NA`

, and we have to handle it

cases[723] <- table(courses[,723])["Yes"] if(is.na(cases[723])) { cases[723] <- 0 }

`logic`

The best one is to use `logic`

vectors instead of `character`

courses[,723]=="Yes"

[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [85] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [97] FALSE

`logic`

can be `numeric`

As we know, R has four **basic data type**: `numeric`

, `factor`

, `logic`

and `character`

. If we mix them, they can change. For example

c(TRUE, FALSE)

[1] TRUE FALSE

c(TRUE, FALSE, 2)

[1] 1 0 2

Mixing `logic`

and `numeric`

gives numbers

What happens when you mix `factor`

and `numeric`

?

And mixing `factor`

and `character`

?

Explain why

`TRUE`

is easySince `TRUE`

becomes 1 and `FALSE`

becomes 0, it is easy to count how many times the result was `TRUE`

sum(courses[,723]=="Yes")

[1] 2

Now we just have to do this for each column of `courses`

(that is, for each simulation of a course)

cases <- rep(NA, ncol(courses)) for(i in 1:ncol(courses)) { cases[i] <- sum(courses[,i]=="Yes") } cases

[1] 1 2 0 1 0 0 1 1 1 1 0 3 0 1 1 3 3 0 1 0 0 1 3 4 2 1 0 0 0 1 1 3 0 1 2 0 3 [38] 1 0 1 2 0 1 1 1 0 0 1 1 0 1 0 1 2 0 0 0 0 1 1 1 1 0 1 0 3 2 1 1 2 0 1 4 1 [75] 3 0 1 0 4 0 1 0 0 0 1 0 1 2 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 3 2 0 1 0 0 1 1 [112] 2 0 1 0 2 1 2 0 0 4 0 2 1 1 1 0 0 0 1 0 0 0 1 2 1 1 1 1 0 0 1 1 1 1 0 0 1 [149] 1 0 0 0 3 0 1 3 2 2 1 0 0 1 2 0 1 0 0 1 2 1 3 1 3 1 2 2 3 1 0 1 0 0 2 0 1 [186] 0 0 0 1 1 1 0 1 1 2 2 0 0 0 1 0 2 0 2 2 0 2 2 1 0 3 0 0 3 0 0 0 1 2 1 0 1 [223] 0 2 0 0 0 1 2 1 1 2 0 2 3 1 1 1 1 3 0 0 1 2 2 3 0 2 1 1 4 0 2 1 2 0 0 2 0 [260] 1 1 1 1 2 2 0 2 1 1 1 2 0 0 2 0 2 0 1 0 1 2 1 1 2 1 1 3 1 1 0 0 2 0 1 1 1 [297] 1 0 0 2 2 1 2 0 0 1 2 3 2 4 1 1 0 0 1 0 2 1 1 2 1 0 1 2 1 2 1 2 1 0 1 1 0 [334] 2 2 3 1 2 1 1 1 0 2 0 1 0 2 2 1 0 4 1 0 0 0 1 1 2 2 0 1 2 1 2 0 2 1 3 0 0 [371] 0 1 0 1 3 0 3 1 1 3 4 0 2 0 2 0 1 2 0 1 1 2 1 1 2 2 2 1 1 0 1 1 0 0 1 3 1 [408] 0 1 3 1 0 3 2 2 0 0 2 1 1 1 0 2 0 0 0 0 0 1 0 0 2 1 2 0 3 1 1 3 0 2 2 1 1 [445] 1 0 2 1 0 3 1 0 1 1 0 2 2 1 0 1 0 1 0 0 1 1 0 0 2 4 0 0 1 0 0 0 0 1 2 0 0 [482] 3 1 0 1 1 1 1 0 3 1 0 1 0 1 2 1 2 0 0 2 0 1 0 1 2 0 0 0 0 0 1 0 1 1 0 1 1 [519] 0 2 1 1 0 1 0 0 1 1 1 0 0 0 0 0 0 1 0 0 2 1 2 0 1 2 1 2 3 1 2 1 1 0 0 1 1 [556] 1 2 2 1 0 2 1 2 0 1 1 0 0 0 3 2 0 3 1 0 2 0 1 0 1 0 0 1 1 3 0 3 1 1 1 2 2 [593] 0 2 3 0 2 0 0 1 1 0 0 1 2 0 0 0 2 2 1 1 0 2 1 1 0 0 0 2 2 0 2 0 1 0 2 1 4 [630] 0 0 1 1 1 0 0 2 1 0 0 0 0 3 0 1 1 2 0 0 1 2 2 1 0 2 2 0 1 0 3 1 0 1 1 0 2 [667] 5 2 2 1 0 2 2 2 0 1 0 1 1 0 1 1 1 3 0 0 1 1 0 0 2 2 0 1 2 1 0 1 2 3 1 0 0 [704] 2 0 0 1 4 1 2 1 0 1 0 0 2 4 1 1 0 0 0 2 0 2 0 0 2 0 0 0 0 1 3 2 1 1 1 1 1 [741] 0 0 1 0 1 0 2 2 1 2 2 0 1 2 1 2 0 2 2 2 1 1 2 0 1 1 0 0 2 2 1 0 1 0 2 2 1 [778] 0 1 0 1 0 0 0 0 1 0 1 2 1 2 1 0 2 1 3 0 3 1 0 1 0 1 1 4 1 0 0 2 0 4 1 0 0 [815] 1 0 0 0 1 1 0 1 1 1 2 5 3 0 0 1 1 2 2 1 2 0 1 0 0 1 1 2 2 2 0 1 0 0 0 0 2 [852] 2 0 2 1 1 1 0 2 1 1 1 0 0 3 1 1 1 0 0 0 1 1 0 0 0 0 2 1 0 2 2 3 0 1 0 2 0 [889] 0 0 0 0 2 1 2 0 1 2 0 0 0 1 0 3 1 1 2 3 1 0 1 0 0 1 1 0 1 0 0 2 1 1 0 0 2 [926] 1 1 1 0 0 3 1 2 1 1 0 1 2 2 1 0 1 0 0 3 0 1 2 4 1 0 0 1 0 1 3 0 0 1 3 1 1 [963] 1 0 1 1 0 1 1 1 2 0 1 0 4 0 1 0 0 1 0 0 1 0 0 0 1 0 0 2 1 0 1 1 2 0 1 2 2 [1000] 1

It is hard to see the **pattern** in the *raw data*

table(cases)

cases 0 1 2 3 4 5 380 358 188 56 16 2

barplot(table(cases))

**What are your results?**

We see near 30% of courses have nobody with epilepsy

Only one third of times we see one person in 97 with epilepsy

How can this be 1% of world population? Did we do something wrong?

mean(courses=="Yes")

[1] 0.01006186

No problem here. Overall population is ~1%

We saw the practical consequences of a theoretical probabilities

The simulation allows us to see the *meaning* of a population property

These two problems are related:

**From reality to theory:**we do experiments to learn about the population.In particular about the

*probabilities*of each outcome. That is, the*probability distribution***From theory to reality:**assuming some probabilities, we can use the computer to explore what can be the consequences.If consequences do not match reality, we

**have to**change our hypothesis

Many important or interesting question can be answered *YES* or *NO*

For example:

- Is this medicine improving health?
- Does this sweetener help us to lose weight?
- Is this DNA sequence related to that organism?

cases[723]>0

or directly

sum(courses[,723]=="Yes")>0

Even more easy and clear

any(courses[,723]=="Yes")

We store the official result in the vector `official`

We store our bet into `our_bet`

**If official and bet are sorted** we do

all(official==our_bet)

s <- sum_two_dice() s==2 | s==4 | s==6 | s==8 | s==10 | s==12

It is better if we use *reminder*

s %% 2 == 0

We normally represent events with a function

event <- function(outcome) {...}

Then we can calculate the empirical frequency of an event

result <- replicate(N, event(simulation(...)) freq <- sum(result)/N

We have to choose `N`

carefully

We do experiments to estimate the value of probabilities

We use the empirical frequencies

There is a lot of variability

How good is our experiment?

We will test different probabilities. Let’s start with 0.5.

p <- 0.5

We can simulate `n`

experiments with success probability `p`

empirical_freq <- function(n, p) { experiments <- sample(c(TRUE,FALSE), prob = c(p, 1-p), size=n, replace = TRUE) return(sum(experiments)/n) }

N <- c(10, 100, 1000, 10000) plot(x=range(N), y=c(0,1), log="x", type="n") for(n in N) { y <- replicate(200, empirical_freq(n, p)) points(x=rep(n, 200), y=y, pch=19) } abline(h=p)

N <- (10^seq(from=1, to=4, by=0.1)) plot(x=range(N), y=c(0,1), log="x", type="n") for(n in N) { y <- replicate(200, empirical_freq(n, p)) points(x=rep(n, 200), y=y, pch=19, cex=0.5) } abline(h=p)

- For each
`N`

we can find intervals containing**most**of the experiments. - Size of the interval depends on how sure you want to be. It is called
*confidence interval*.

With few data we can be far away from the real probability

If we have more data, then we are probably closer

We can even know how far away.

The *Central Limit Theorem* tells us what is that distance

- If the real probability of an event is \(p\)
- and the sample size is \(n\)
- we calculate \(\sigma=\sqrt{p(1-p)/n}\)
- then the
**empirical frequency**\(m/n\) is in the range- \(p\pm 2\sigma\) with probability 95%
- \(p\pm 3\sigma\) with probability 99%
- \(p\pm 5\sigma\) with probability 99.999%

If we have enough data, the empirical frequency \(m/n\) cannot be far away from the real probability \(p\).

The interval depends on \(p\), which we usually do not know.

If we only know \(m/n\), **how far can be the real probability**?

We cannot use the same formula. It fails when \(m/n\) is 0.

The result depends on a value \(z\). Let \[\begin{aligned}\tilde{n} &= n + z^2\\
{\tilde {p}}&=\frac{m+z^2/2}{\tilde{n}}=\frac{m+z^2/2}{n+z^2}\end{aligned}\] Then \({\tilde {\sigma}(z)}={\sqrt{\tilde {p}\left(1-{\tilde {p}}\right)/\tilde {n}}}\), and the **real probability** \(p\) is in the range

- \(\tilde {p}\pm 2\cdot\tilde {\sigma}(2)\) with probability 95%
- \(\tilde {p}\pm 3\cdot\tilde {\sigma}(3)\) with probability 99%
- \(\tilde {p}\pm 5\cdot\tilde {\sigma}(5)\) with probability 99.999%

Let \(z=2\). Then \(\tilde{n} = n + 4\) and \({\tilde {p}}=(m+2)/(n + 4).\)

The confidence interval for \(p\) is given by \[\frac{m+2}{n + 4}\pm 2{\sqrt{\frac{m+2}{(n + 4)^2}\left(1-\frac{m+2}{n + 4}\right)}}\]

Real probability is in this range with 95% confidence.

**This is the formula we have to remember**

- Points are your experimental result (event frequency)
- Vertical lines represent the calculated 95% interval
- Horizontal line represent the real probability

- Event: function from
*outcome*to*logic* - You can use
`any(...)`

- You can use
`all(...)`

- Event count:
`sum(replicate(N, event(...)))`

- Event frequency:
`sum(replicate(N, event(...)))/N`

- Real probability is somewhere in \(\tilde {p}\pm 2\cdot\tilde {\sigma}(2)\) with probability 95%

Agresti, Alan; Coull, Brent A. (1998). *“Approximate is better than ‘exact’ for interval estimation of binomial proportions”*. The American Statistician. 52: 119–126. doi:10.2307/2685469. JSTOR 2685469. MR 1628435.