Class 7: Central Limit Theorem

Methodology of Scientific Research

AndrΓ©s Aravena, PhD

5 May 2021

Homework Answers

How many people to interview?

If you do not have access to the age distribution, but you know only the standard deviation

  • How many people you need to interview to estimate the average age of the Turkish population with a margin of error of 5 years?
  • … of 1 year?
  • … of 1 month?
  • What is the probability that the real value is inside the intervals you have found?

sample mean v/s population mean

We are looking for population mean \(𝔼X\)

We know the population variance \(𝕍X\)

We interview \(n\) people and calculate \(\bar{𝐗}\)

A confidence interval

The population average is probably in the interval \[\left[\bar{𝐗}-c\sqrt{𝕍(X)/n}, \bar{𝐗}+c\sqrt{𝕍(X)/n}\right]\]

Using Chebyshev’s inequality, we know that the probability is at least \(1-1/c^2\)

Interval width

We want the interval width to be less than 5 (or 1, or 1/12) years

Let’s say \[y≀2c\sqrt{𝕍(X)/n}\] therefore \[nβ‰₯4c^2𝕍(X)/y^2\]

Evaluating with known values

Variance 𝕍(X) is 473.23 yr2

c Prob 5 yr 1 yr 1 month
2 75% 1,515 7,572 90,860
3 89% 3,408 17,037 204,435
5 96% 9,465 47,323 567,874
10 99% 37,859 189,292 2,271,496

Insurance

Homework Question 1

Statement

A company wants to offer insurance to protect against the economic damage of COVID-19.

  • If a person takes the insurance, they pay π‘₯.
  • If they get COVID in the next year , then they got paid a fixed amount 𝑦
    • this happens with probability 𝑝
  • After one year, the company will have a net result 𝑅 corresponding to income minus expenses.
  • Since expenses depend on how many people get sick, 𝑅 is a random variable.

Questions

  • What is the expected value of the net result 𝑅?
  • What are the variance and standard deviation of the net result 𝑅?
  • What is the interval that contains the real net result 𝑅 with 99% probability?

Calculating net result 𝑅

We have \(n\) people paying, and \(𝑆\) people getting sick. The result is \[R=nx-𝑆y\] For our analysis, \(n, x\) and \(y\) are fixed, but \(𝑆\) is a random variable. Thus \(R\) is a random variable.

We want to know \(𝔼(R)\)

How can we calculate it?

Expected value of 𝑅

Using the definition, we have \[𝔼(R)=𝔼(nx-sy)=nx-𝔼(𝑆)y\] So we need to calculate \(𝔼(𝑆)\)

What do we know about \(𝑆\)?

𝑆 is the number of sick people

There are \(n\) people, each one can get sick with probability \(p\)

Each person is a β€œcoin” with probability \(p\)

Thus \(𝑆\) is a sum of coins

𝑆 follows a Binomial distribution

Assuming that each person gets sick independently, then \[𝑆 \sim Binom(n,p)\] Therefore, we immediately know that \[𝔼(𝑆)=np\qquad 𝕍(𝑆)=np(1-p)\]

Answers

  • What is the expected value of the net result 𝑅? \[𝔼(R)=nx-𝔼(𝑆)y=nx-npy\]
  • What are the variance and standard deviation of the net result 𝑅? \[𝕍(R)=𝕍(nx)+𝕍(-𝑆y)=0+ 𝕍(𝑆)y^2 =np(1-p)y^2\]
  • What is the interval that contains the real net result 𝑅 with 99% probability?

Confidence interval

After one year, the result \(R\) will be somewhere \[\left[𝔼(R)-c\sqrt{𝕍(R)}, 𝔼(R)+c\sqrt{𝕍(R)}\right]\] That is \[\left[nx-npy-cy\sqrt{np(1-p)}, nx-npy+cy\sqrt{np(1-p)}\right]\]

How do we choose \(x\) and \(y\)?

How not to get broke

We want \(Rβ‰₯0,\) so the lower limit of the interval must be positive \[nx-npy-cy\sqrt{npq}β‰₯0\] thus \[\frac{x}{y}β‰₯p+c\sqrt{\frac{p(1-p)}{n}}\]

With numbers

Assuming \(p=0.1,\) then \(x/y\) must be at least

c Prob 10 100 1000 10000 100000
2 75% 0.29 0.16 0.12 0.11 0.10
3 89% 0.38 0.19 0.13 0.11 0.10
5 96% 0.57 0.25 0.15 0.12 0.10
10 99% 1.05 0.40 0.19 0.13 0.11

Can we do better?

Chebyshev is pessimistic

We used Chebyshev formula, which does not need any hypothesis

But we have more information. We know that \(𝑆\) is a Binomial random variable

Therefore we can make better confidence intervals

Calculating β„™(𝑆=x)

We know that \[β„™(𝑆=k|n\text{ in total})=\binom{n}{k} p^k(1-p)^{n-k}\] We can calculate \(\binom{n}{k}\) using Pascal’s triangle, even in Excel

Binomial coefficient in Excel

Pascal’s Triangle

Binomial distribution in Excel

Binomial distribution visually

Making the interval for 𝑆

\[β„™(𝑆≀k)=\sum_{j=0}^k β„™(𝑆=j)\]

Use better tools

Good tools include functions to calculate the usual distributions

In Excel we have BINOM.DIST(k, n, p, cumulative)

In R we have pbinom() and dbinom()

What happens when 𝑛 is big?

A simple model

Now we have a coin 𝑋 with two possible outcomes: +1 and -1

To make life easy, we assume 𝑝=0.5

What are the expected value and variance of X ?

Throw the coin 𝑛 times

We throw the coin 𝑛 times, and we calculate π‘Œ, the sum of all 𝑋 \[Y=\sum_{i=1}^𝑛 X_i\]

What are the expected value and variance of π‘Œ ?

It is easy to see that

  • π‘Œ is basically a Binomial random variable
  • π”Όπ‘Œ = 0, because 𝔼𝑋 = 0
  • π•π‘Œ = 𝑁, because 𝕍𝑋 = 1

Fix the variance to 1

Now consider \(Z_n=Y/\sqrt{𝑛}\)

It is easy to see that \(𝔼Z_n = 0\) and \(𝕍Z_n = 1\) independent of 𝑛

The possible values of \(Z_n\) are not integers. Not even rationals

What happens with \(Z_n\) when 𝑛 is really big?

Central limit theorem

When \(nβ†’βˆž,\) the distribution of \(Z_n=βˆ‘ X/\sqrt{𝑛}\) will converge toa Normal distribution \[\lim_{nβ†’βˆž} Z_n ∼ Normal(0,1)\]

More in general

If \(X_i\) is a set of independent, identically distributed random variables, with expected value \[𝔼X_i=ΞΌ\quad\text{for all }i\] and variance \[𝕍X_i=Οƒ^2\quad\text{for all }i\] then, when \(n\) is large \[\lim_{nβ†’βˆž} \frac{\sum_i X_i-ΞΌ}{Οƒ\sqrt{𝑛}} ∼ Normal(0,1)\]

In other words

If \(X_i\) is a set of independent, identically distributed random variables, with expected value \[𝔼X_i=ΞΌ\quad\text{for all }i\] and variance \[𝕍X_i=Οƒ^2\quad\text{for all }i\] then, when \(n\) is large \[\lim_{nβ†’βˆž} \frac{\sum_i X_i-ΞΌ}{\sqrt{𝑛}} ∼ Normal(0, Οƒ^2)\]

This is why Normal distributions are important

Noise is usually Normal

  • Thermal noise is the sum of many small vibrations in all directions
    • they sum and usually cancel each other
  • Phenotype depends on several genetic conditions
    • Height, weight and similar attributes depend on the combination of several attributes

It does not always work

  • Not all combined effects are sums
    • some effects are multiplicative
  • Some effects may not have finite variace
    • sometimes variance is infinite
  • Not all effects are independent
    • this is the most critical issue