April 25, 2018

The meaning of 1%

One in 100 people has epilepsy

We can simulate what can be a course like us

epilepsy <- function(size) {
    return(sample(c("Yes","No"), size=size, 
            replace = TRUE, prob = c(0.01, 0.99)))
}

Here we fix the real probability to 1%

We replicate 1000 groups of size 97

Here courses is a matrix, not a data frame

courses <- replicate(1000, epilepsy(97))
courses[,723] # show replica 723
 [1] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
[13] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
[25] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
[37] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
[49] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
[61] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "Yes"
[73] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
[85] "No"  "No"  "No"  "Yes" "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
[97] "No" 

We want numbers, not words

There are several ways to do this.

For example we can use table()

table(courses[,723])["Yes"]

What happens if there isn’t any "Yes"? We get NA, and we have to handle it

cases[723] <- table(courses[,723])["Yes"]
if(is.na(cases[723])) {
    cases[723] <- 0
}

Better use logic

The best one is to use logic vectors instead of character

courses[,723]=="Yes"
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
[73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[85] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[97] FALSE

This is better because logic can be numeric

As we know, R has four basic data type: numeric, factor, logic and character. If we mix them, they can change. For example

c(TRUE, FALSE)
[1]  TRUE FALSE
c(TRUE, FALSE, 2)
[1] 1 0 2

Mixing logic and numeric gives numbers

Homework

What happens when you mix factor and numeric?

And mixing factor and character?

Explain why

Counting TRUE is easy

Since TRUE becomes 1 and FALSE becomes 0, it is easy to count how many times the result was TRUE

sum(courses[,723]=="Yes")
[1] 2

Now we just have to do this for each column of courses

(that is, for each simulation of a course)

Let’s count the number of cases in each course

cases <- rep(NA, ncol(courses))
for(i in 1:ncol(courses)) {
    cases[i] <- sum(courses[,i]=="Yes")
}
cases
   [1] 1 2 0 1 0 0 1 1 1 1 0 3 0 1 1 3 3 0 1 0 0 1 3 4 2 1 0 0 0 1 1 3 0 1 2 0 3
  [38] 1 0 1 2 0 1 1 1 0 0 1 1 0 1 0 1 2 0 0 0 0 1 1 1 1 0 1 0 3 2 1 1 2 0 1 4 1
  [75] 3 0 1 0 4 0 1 0 0 0 1 0 1 2 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 3 2 0 1 0 0 1 1
 [112] 2 0 1 0 2 1 2 0 0 4 0 2 1 1 1 0 0 0 1 0 0 0 1 2 1 1 1 1 0 0 1 1 1 1 0 0 1
 [149] 1 0 0 0 3 0 1 3 2 2 1 0 0 1 2 0 1 0 0 1 2 1 3 1 3 1 2 2 3 1 0 1 0 0 2 0 1
 [186] 0 0 0 1 1 1 0 1 1 2 2 0 0 0 1 0 2 0 2 2 0 2 2 1 0 3 0 0 3 0 0 0 1 2 1 0 1
 [223] 0 2 0 0 0 1 2 1 1 2 0 2 3 1 1 1 1 3 0 0 1 2 2 3 0 2 1 1 4 0 2 1 2 0 0 2 0
 [260] 1 1 1 1 2 2 0 2 1 1 1 2 0 0 2 0 2 0 1 0 1 2 1 1 2 1 1 3 1 1 0 0 2 0 1 1 1
 [297] 1 0 0 2 2 1 2 0 0 1 2 3 2 4 1 1 0 0 1 0 2 1 1 2 1 0 1 2 1 2 1 2 1 0 1 1 0
 [334] 2 2 3 1 2 1 1 1 0 2 0 1 0 2 2 1 0 4 1 0 0 0 1 1 2 2 0 1 2 1 2 0 2 1 3 0 0
 [371] 0 1 0 1 3 0 3 1 1 3 4 0 2 0 2 0 1 2 0 1 1 2 1 1 2 2 2 1 1 0 1 1 0 0 1 3 1
 [408] 0 1 3 1 0 3 2 2 0 0 2 1 1 1 0 2 0 0 0 0 0 1 0 0 2 1 2 0 3 1 1 3 0 2 2 1 1
 [445] 1 0 2 1 0 3 1 0 1 1 0 2 2 1 0 1 0 1 0 0 1 1 0 0 2 4 0 0 1 0 0 0 0 1 2 0 0
 [482] 3 1 0 1 1 1 1 0 3 1 0 1 0 1 2 1 2 0 0 2 0 1 0 1 2 0 0 0 0 0 1 0 1 1 0 1 1
 [519] 0 2 1 1 0 1 0 0 1 1 1 0 0 0 0 0 0 1 0 0 2 1 2 0 1 2 1 2 3 1 2 1 1 0 0 1 1
 [556] 1 2 2 1 0 2 1 2 0 1 1 0 0 0 3 2 0 3 1 0 2 0 1 0 1 0 0 1 1 3 0 3 1 1 1 2 2
 [593] 0 2 3 0 2 0 0 1 1 0 0 1 2 0 0 0 2 2 1 1 0 2 1 1 0 0 0 2 2 0 2 0 1 0 2 1 4
 [630] 0 0 1 1 1 0 0 2 1 0 0 0 0 3 0 1 1 2 0 0 1 2 2 1 0 2 2 0 1 0 3 1 0 1 1 0 2
 [667] 5 2 2 1 0 2 2 2 0 1 0 1 1 0 1 1 1 3 0 0 1 1 0 0 2 2 0 1 2 1 0 1 2 3 1 0 0
 [704] 2 0 0 1 4 1 2 1 0 1 0 0 2 4 1 1 0 0 0 2 0 2 0 0 2 0 0 0 0 1 3 2 1 1 1 1 1
 [741] 0 0 1 0 1 0 2 2 1 2 2 0 1 2 1 2 0 2 2 2 1 1 2 0 1 1 0 0 2 2 1 0 1 0 2 2 1
 [778] 0 1 0 1 0 0 0 0 1 0 1 2 1 2 1 0 2 1 3 0 3 1 0 1 0 1 1 4 1 0 0 2 0 4 1 0 0
 [815] 1 0 0 0 1 1 0 1 1 1 2 5 3 0 0 1 1 2 2 1 2 0 1 0 0 1 1 2 2 2 0 1 0 0 0 0 2
 [852] 2 0 2 1 1 1 0 2 1 1 1 0 0 3 1 1 1 0 0 0 1 1 0 0 0 0 2 1 0 2 2 3 0 1 0 2 0
 [889] 0 0 0 0 2 1 2 0 1 2 0 0 0 1 0 3 1 1 2 3 1 0 1 0 0 1 1 0 1 0 0 2 1 1 0 0 2
 [926] 1 1 1 0 0 3 1 2 1 1 0 1 2 2 1 0 1 0 0 3 0 1 2 4 1 0 0 1 0 1 3 0 0 1 3 1 1
 [963] 1 0 1 1 0 1 1 1 2 0 1 0 4 0 1 0 0 1 0 0 1 0 0 0 1 0 0 2 1 0 1 1 2 0 1 2 2
[1000] 1

Let’s make a table and a barplot

It is hard to see the pattern in the raw data

table(cases)
cases
  0   1   2   3   4   5 
380 358 188  56  16   2 
barplot(table(cases))

What are your results?

But it was only 1%! Is this right?

We see near 30% of courses have nobody with epilepsy

Only one third of times we see one person in 97 with epilepsy

How can this be 1% of world population? Did we do something wrong?

mean(courses=="Yes")
[1] 0.01006186

No problem here. Overall population is ~1%

What did we learn here?

We saw the practical consequences of a theoretical probabilities

The simulation allows us to see the meaning of a population property

There are two type of questions

These two problems are related:

  1. From reality to theory: we do experiments to learn about the population.

    In particular about the probabilities of each outcome. That is, the probability distribution

  2. From theory to reality: assuming some probabilities, we can use the computer to explore what can be the consequences.

    If consequences do not match reality, we have to change our hypothesis

Events

Events are TRUE or FALSE

Many important or interesting question can be answered YES or NO

For example:

  • Is this medicine improving health?
  • Does this sweetener help us to lose weight?
  • Is this DNA sequence related to that organism?

Event: do we have someone with epilepsy?

cases[723]>0

or directly

sum(courses[,723]=="Yes")>0

Even more easy and clear

any(courses[,723]=="Yes")

Event: did you win Super-Lotto?

We store the official result in the vector official

We store our bet into our_bet

If official and bet are sorted we do

all(official==our_bet)

Is the sum of pair of dice an even number?

s <- sum_two_dice()
s==2 | s==4 | s==6 | s==8 | s==10 | s==12

It is better if we use reminder

s %% 2 == 0

Events’ frequencies are easy

We normally represent events with a function

event <- function(outcome) {...}

Then we can calculate the empirical frequency of an event

result <- replicate(N, event(simulation(...))
freq <- sum(result)/N 

We have to choose N carefully

The two sides of this coin

We use real data to find probabilities

We do experiments to estimate the value of probabilities

We use the empirical frequencies

There is a lot of variability

How good is our experiment?

Empirical frequencies v/s real probabilities

We will test different probabilities. Let’s start with 0.5.

p <- 0.5

We can simulate n experiments with success probability p

empirical_freq <- function(n, p) {
  experiments <- sample(c(TRUE,FALSE), prob = c(p, 1-p),
                        size=n, replace = TRUE)
  return(sum(experiments)/n)
}

Test for powers of 10

N <- c(10, 100, 1000, 10000)
plot(x=range(N), y=c(0,1), log="x", type="n")
for(n in N) {
    y <- replicate(200, empirical_freq(n, p))
    points(x=rep(n, 200), y=y, pch=19)
}
abline(h=p)

Same with more detail

N <- (10^seq(from=1, to=4, by=0.1))
plot(x=range(N), y=c(0,1), log="x", type="n")
for(n in N) {
    y <- replicate(200, empirical_freq(n, p))
    points(x=rep(n, 200), y=y, pch=19, cex=0.5)
}
abline(h=p)

There is an envelope

  • For each N we can find intervals containing most of the experiments.
  • Size of the interval depends on how sure you want to be. It is called confidence interval.

More data gives better results

With few data we can be far away from the real probability

If we have more data, then we are probably closer

We can even know how far away.

The Central Limit Theorem tells us what is that distance

Frequency must be close to real probability

  • If the real probability of an event is \(p\)
  • and the sample size is \(n\)
  • we calculate \(\sigma=\sqrt{p(1-p)/n}\)
  • then the empirical frequency \(m/n\) is in the range
    • \(p\pm 2\sigma\) with probability 95%
    • \(p\pm 3\sigma\) with probability 99%
    • \(p\pm 5\sigma\) with probability 99.999%

Can we do it backwards?

If we have enough data, the empirical frequency \(m/n\) cannot be far away from the real probability \(p\).

The interval depends on \(p\), which we usually do not know.

If we only know \(m/n\), how far can be the real probability?

We cannot use the same formula. It fails when \(m/n\) is 0.

Interval correction

The result depends on a value \(z\). Let \[\begin{aligned}\tilde{n} &= n + z^2\\ {\tilde {p}}&=\frac{m+z^2/2}{\tilde{n}}=\frac{m+z^2/2}{n+z^2}\end{aligned}\] Then \({\tilde {\sigma}(z)}={\sqrt{\tilde {p}\left(1-{\tilde {p}}\right)/\tilde {n}}}\), and the real probability \(p\) is in the range

  • \(\tilde {p}\pm 2\cdot\tilde {\sigma}(2)\) with probability 95%
  • \(\tilde {p}\pm 3\cdot\tilde {\sigma}(3)\) with probability 99%
  • \(\tilde {p}\pm 5\cdot\tilde {\sigma}(5)\) with probability 99.999%

Simple version of Interval correction

Let \(z=2\). Then \(\tilde{n} = n + 4\) and \({\tilde {p}}=(m+2)/(n + 4).\)

The confidence interval for \(p\) is given by \[\frac{m+2}{n + 4}\pm 2{\sqrt{\frac{m+2}{(n + 4)^2}\left(1-\frac{m+2}{n + 4}\right)}}\]

Real probability is in this range with 95% confidence.

This is the formula we have to remember

Visually

  • Points are your experimental result (event frequency)
  • Vertical lines represent the calculated 95% interval
  • Horizontal line represent the real probability

Summary

  • Event: function from outcome to logic
  • You can use any(...)
  • You can use all(...)
  • Event count: sum(replicate(N, event(...)))
  • Event frequency: sum(replicate(N, event(...)))/N
  • Real probability is somewhere in \(\tilde {p}\pm 2\cdot\tilde {\sigma}(2)\) with probability 95%

References

Agresti, Alan; Coull, Brent A. (1998). “Approximate is better than ‘exact’ for interval estimation of binomial proportions”. The American Statistician. 52: 119–126. doi:10.2307/2685469. JSTOR 2685469. MR 1628435.