Class 30: Law of Large Numbers

Methodology of Scientific Research

Andrés Aravena, PhD

May 17, 2023

What we know so far

  • Experiments are small samples of a large population

  • There is variability in the population

  • There is noise in every measurement

  • We want to understand the population, but we only have a sample

  • We want to separate signal and noise

Backwards reasoning

  • First, we will assume that we know the population

  • We will predict what can happen in any random sample

  • We will compare the predicted sample with the experimental one

  • Then we will analyze what does this teach us about the population

About the population

We will do an experiment that we call \(X\).
Let’s assume that we know

  • The set \(Ω\) of all possible outcomes
  • The probability \(ℙ(X=x)\) of each outcome \(x∈Ω\)

Then we can calculate

  • the expected value \(𝔼X\) a.k.a. population mean
  • the population variance \(𝕍X\)

Example

  • The experiment is to ask the age of a random person

  • Population is “the age of every people living in Turkey”

  • \(Ω\) is the natural numbers ≤200

  • \(ℙ(X=x)\) is the proportion of people with age \(x\)

  • \(𝔼\,X\) is the average age of people in Turkey

  • \(𝕍\,X\) is the variance of age of people in Turkey

Let’s calculate

Let’s use the data from Türkiye age distribution

  • What is the expected value of age?

  • What is the variance of age?

  • What does that mean?

What to expect as outcome

The expected value does not tell us exactly what to expect

But it tells us approximately

Outcomes are probably near the expected value

Probably near the expected value

We have the following result \[ℙ(𝔼X-c\sqrt{𝕍X} ≤ X ≤ 𝔼X+c\sqrt{𝕍X})≥ 1-1/c^2\] That is, outcomes are probably close to the expected value

\(c\) is a constant that tells us how many standard deviations we need to increase the probability of getting an outcome close to the expected value

This is Chebyshev’s inequality

Proved by Pafnuty Lvovich Chebyshev (Пафну́тий Льво́вич Чебышёв) in 1867

It is always valid, for any probability distribution

Later we will see better rules valid only for specific distributions

Alternative formula

Chebyshev inequality can also be written as \[ℙ(|X-𝔼X|≤ c⋅\sqrt{𝕍X})≥ 1-1/c^2\]

The probability that

“an outcome \(X\) is near \(𝔼X\) by less than \(c⋅\sqrt{𝕍X}\)

is greater than \(1-1/c^2\)

Another alternative formula

Chebyshev inequality can also be written as \[ℙ(|X-𝔼X| > c⋅\sqrt{𝕍X})≤ 1/c^2\]

The probability that

“the distance between \(𝔼X\) and any outcome \(X\) is more than than \(c⋅\sqrt{𝕍X}\)

is less than \(1/c^2\)

Some examples of Chebyshev’s inequality

\[ℙ(𝔼X -c⋅\sqrt{𝕍X}≤ X ≤ 𝔼X +c⋅\sqrt{𝕍X})≥ 1-1/c^2\]

Replacing \(c\) for some specific values, we get

\[\begin{aligned} ℙ(|X-𝔼X| ≤ 1⋅\sqrt{𝕍X})&≥ 1-1/1^2=0\\ ℙ(|X-𝔼X| ≤ 2⋅\sqrt{𝕍X})&≥ 1-1/2^2=0.75\\ ℙ(|X-𝔼X| ≤ 3⋅\sqrt{𝕍X})&≥ 1-1/3^2=0.889 \end{aligned}\]

For any numerical population

  • at least 3/4 of the population lie within two standard deviations of the mean, that is, in the interval with endpoints \(𝔼X±2⋅\sqrt{𝕍X}\)
  • at least 8/9 of the population lie within three standard deviations of the mean, that is, in the interval with endpoints \(𝔼X±3⋅\sqrt{𝕍X}\)
  • at least \(1-1/c^2\) of the population lie within \(c\) standard deviations of the mean, that is, in the interval with endpoints \(𝔼X±c⋅\sqrt{𝕍X},\) where \(c\) is any positive number greater than 1

Exercise

What are the age intervals that contain

  • at least 75% of Turkish population

  • at least 8/9 of Turkish population

  • at least 99% of Turkish population

Proof of Chebyshev Inequality

(read this if you want to know the truth)

A tool that we need

If \(Q\) is a yes-no question, we will use the notation \(〚Q〛\) to represent this:

\[〚Q〛=\begin{cases} 1\quad\text{if }Q\text{ is true}\\ 0\quad\text{if }Q\text{ is false} \end{cases}\]

It is a nice way to write sums limits

Instead of cramming symbols over and under ∑ \[\sum_{x=1}^{10} f(x)\] we can write the limits at normal size \[\sum_x f(x) 〚1≤x≤10〛\]

It is a nice way to decompose events

If we want to calculate the probability of the event \(Q\), instead of writing \[ℙ(Q)=\sum_{x\text{ makes }Q\text{ true}}ℙ(X=x)\] we can write \[ℙ(Q)=\sum_{x}ℙ(X=x) 〚Q(x)〛\]

Proof of Chebyshev’s inequality

By the definition of variance, we have \[𝕍(X)=𝔼(X-𝔼X)^2=\sum_{x∈Ω} (x-𝔼X)^2ℙ(X=x)\] If we multiply the probability by a number that is sometimes 0 and sometimes 1, the right side has to be smaller \[𝕍(X)≥\sum_{x∈Ω} (x-𝔼X)^2ℙ(X=x)〚(x-𝔼X)^2≥α〛\]

We want to make it even smaller

Proof of Chebyshev’s inequality (cont)

Since we are only taking the cases where \((X-𝔼X)^2≥α\), replacing \((X-𝔼X)^2\) by \(α\) will make the right side even smaller

\[\begin{aligned} 𝕍(X)& ≥α\sum_{x∈Ω} ℙ[X=x]((x-𝔼X)^2≥α)\\ & =αℙ\left((X-𝔼X)^2≥α\right) \end{aligned}\] Then we can divide by \(α\) and we get Chebyshev’s result \[ℙ\left((X-𝔼X)^2≥α\right)≤𝕍(X)/α\]

Using the inequality

Chebyshev’s result is \(ℙ\left((X-𝔼X)^2≥α\right)≤𝕍(X)/α.\)

If we choose \(α=c^2⋅𝕍X\) then we have \[ℙ\left((X-𝔼X)^2 ≥ c^2⋅𝕍X \right)≤ 1/c^2\] If we get rid of the squares, we get \[ℙ(|X-𝔼X| ≥ c\sqrt{𝕍X})≤ 1/c^2\] This is the probability that the outcome is far away from the expected value

Probability of being near 𝔼X

Now we can look at the opposite event \[ℙ(|X-𝔼X| ≤ c\sqrt{𝕍X})≥ 1-1/c^2\] The event inside \(ℙ()\) can be rewritten as \[-c\sqrt{𝕍X} ≤ X-𝔼X ≤ c\sqrt{𝕍X}\] which means that the outcome is near the expected value \[𝔼X-c\sqrt{𝕍X} ≤ X ≤ 𝔼X+c\sqrt{𝕍X}\]

Another point of view

The event inside \(ℙ()\) is \(|X-𝔼X| ≤ c\sqrt{𝕍X}\)

As we said, it can be rewritten as \[-c\sqrt{𝕍X} ≤ X-𝔼X ≤ c\sqrt{𝕍X}\] which also means that the expected value is near the outcome \[X-c\sqrt{𝕍X} ≤ 𝔼X ≤ X+c\sqrt{𝕍X}\]

This is a confidence interval

Application

Previous class

We have a small sample \(𝐗=(X_1,…,X_n)\)
All random variables \(X_i\) are i.i.d.
The average \(\bar{𝐗}\) is also a random variable

\[𝔼\,\bar{𝐗}=𝔼\,\text{mean}(𝐗)=𝔼\,X\] \[𝔼\,\text{var}(𝐗) = \frac{n-1}{n}𝕍\,X\] What about \(𝕍\,\bar{𝐗}\)?

Variance of the sample mean

We have \(𝕍(α X+βY)=α^2𝕍(X)+β^2𝕍(Y),\) thus \[𝕍(\bar{𝐗})=𝕍\left(\frac{1}{n}\sum_i X_i\right)=\frac{1}{n^2}𝕍\sum_i X_i=\frac{1}{n^2}\sum_i 𝕍 X_i\] and since all \(X_i\) come from the same population \[𝕍(\bar{𝐗})=\frac{1}{n^2}\sum_i 𝕍 X=\frac{n}{n^2}𝕍 X=\frac{1}{n}𝕍 X\]

Standard error

Averages of bigger samples have smaller variance \[𝕍(\bar{𝐗})=\frac{1}{n}𝕍 X\] Its square root is the standard deviation of the sample average \[\sqrt{𝕍(\bar{𝐗})}=\sqrt{\frac{1}{n}𝕍 X}=\frac{\text{stdev}(X)}{\sqrt{n}}\]

This is important. It has its own name: Standard Error

To remember

Standard error is the standard deviation of the sample average

It is calculated as the standard deviation of the population divided by the square root of \(n\)

Combining with Chebyshev’s inequality

For any random variable \(X,\) we have \[ℙ(|X-𝔼X| ≤ c\sqrt{𝕍X})≥ 1-1/c^2\] in the case of \(\bar{𝐗}\) we have \[ℙ\left(|\bar{𝐗}-𝔼\bar{𝐗}| ≤ c\sqrt{𝕍\bar{𝐗}}\right)≥ 1-1/c^2\] that is \[ℙ\left(|\bar{𝐗}-𝔼X| ≤ c\sqrt{𝕍(X)/n}\right)≥ 1-1/c^2\]

Written as an interval

We have \[ℙ\left(-c\sqrt{𝕍(X)/n}≤ 𝔼X-\bar{𝐗} ≤ c\sqrt{𝕍(X)/n}\right)≥ 1-1/c^2\] This can also be written as \[ℙ\left(\bar{𝐗}-c\sqrt{𝕍(X)/n}≤𝔼X ≤ \bar{𝐗}+c\sqrt{𝕍(X)/n}\right)≥ 1-1/c^2\]

Thus, we have an interval that probably contains the population mean

A confidence interval

We want to know the population mean \(𝔼\,X\)

We take a random sample (of 𝑛 people)

The population average is in the interval \[\left[\bar{𝐗}-c\sqrt{𝕍(X)/n}, \bar{𝐗}+c\sqrt{𝕍(X)/n}\right]\] with probability at least \(1-1/c^2\)

This is called a confidence interval

Law of Large Numbers

This is an important result. It says that

  • The sample average is close to the population average

  • When the sample size is large, the interval is narrower

  • The margin of error depends on

    • the confidence level we chose
    • the standard deviation of the population \(\sqrt{𝕍(X)}\)
    • the square root of the sample size \(\sqrt{n}\)

Frequentist v/s Bayesian philosophies

As you know, there are two schools of probabilities

  • Bayesian school see probabilities as degrees of belief

  • Frequentists see probabilities as averages of many experiments

The Law of Large numbers shows that, if samples are large, both points of view give the same result

The bad news

The margin of error depends on the square root of the sample size \(\sqrt{n}\)

Thus, to get double precision, we need 4 times more data

To get one more decimal place (10 times more precision) we need 100 times more data

The other bad news

The margin of error depends on the standard deviation of the population

That is, the square root of the population variance \(𝕍(X)\)

But we do not know the population variance

What can we do?

Exercises

How many people you need to interview to estimate the average age of Turkish population with a margin of error

… of 5 years?

… of 1 year?

… of 1 month?