Class 22: Confidence Intervals

Computing for Molecular Biology 2

Andrés Aravena, PhD

28 May 2021

Summary of last part

There is population and samples

Do not confuse them

Identify them

Populations

Big. Sometimes imaginary

They have mean, variance and standard deviation

mean(population)
var(population)
sd(population)

Variance and Standard Deviation

Variance is the square of standard deviation

var(population) == sd(population)^2

Standard deviation is the square root of variance

sd(population) == sqrt(var(population))

What variance tells us about outcomes

An outcome is a random element of the population

Any outcome will be similar to the population mean

How similar? It depends on the population variance

Outcome versus population mean

The distance between outcome and population mean depends on the confidence level

outcome will probably be between mean(pop)±k*sd(pop)

\[\bar{X} - k\,𝕊𝔻(X)≤\text{outcome}≤\bar{X} + k\,𝕊𝔻(X)\]

The key idea is: different k have different probability

If the population has Normal distribution

(This is not always true)

When the population is Normal, then

k	Probability
2	≈ 95%
3	≈ 99%
`qnorm(1-alpha/2)`	`1-alpha`

Choose your own alpha

If the population other distribution

(Chebyshev always works)

k	Probability
2	≥ 1-1/2² = 75%
3	≥ 1-1/3² = 88.9%
10	≥ 1-1/10² = 99%
k	≥ 1-1/k²

If you do not have a bell shaped curve, this is your safety net

Sample

A sample is a group of outcomes

It has size, mean, variance, and standard deviation

length(sample)
mean(sample)
var(sample)
sd(sample)

Sample Mean

Each sample is different (random)

Each sample mean is random

If the sample size is large, then sample mean has Normal distribution

Sample Mean Normal parameters

If sample mean has Normal distribution, we need to know it parameters

The average of sample mean is the population mean

The standard deviation of sample mean is population standard deviation divided by the square root of sample size

The variance of sample mean is population variance divided by sample size

Predicting Sample Mean

sample mean will probably be between mean(pop)±k*sd(pop)

\[\bar{X} - k\frac{𝕊𝔻(X)}{\sqrt{\text{n}}}≤\text{sample mean}≤\bar{X} + k\frac{𝕊𝔻(X)}{\sqrt{\text{n}}}\]

The key idea is: different k have different probability

Sample mean has Normal distribution

k	Probability
2	≈ 95%
3	≈ 99%
`qnorm(1-alpha/2)`	`1-alpha`

Choose your own alpha

Inverse problem: from sample to population

In real life we do not know population mean, and we want to know.

We only know sample mean, and sample variance

We can approximate population variance by sample variance

But we have to pay a cost

Predicting Population Mean

population mean will probably be between mean(sample)±k*sd(pop)

\[\text{sample mean} - k\frac{𝕊𝔻(X)}{\sqrt{\text{n}}}≤ \bar{X} ≤\text{sample mean} + k\frac{𝕊𝔻(X)}{\sqrt{\text{n}}}\]

The key idea is: different k have different probability

The cost of ignoring population variance

Now we have Student’s t distribution

This depends on degrees of freedom (sample size-1)

k	Probability
`qt(1-alpha/2, df)`	`1-alpha`

Choose your own alpha

Shape of Student’s t distribution

Student’s t has fat tails

The price to pay for not knowing the population variance is to use Student’s t instead of Normal distribution.

Intervals using Student’s t are wider (and less useful)

To avoid this problem, and get an useful results, we need large enough samples