Central Limit Theorem and Normal Distribution

29 November 2017

Exercise: Conditional probabilities

In the wardrobe drawer there are 24 socks. Half are black, the other half is white.

If you take two socks at random (let’s say, closing your eyes), what is the probability that you get a matching pair?

Law of Large Numbers

If we have \(n\) random variables \(x_i\), all independent and all with the same distribution, then their average \[A_n=\frac{1}{n}\sum_i^n x_i\] converges to the expected value \(\mathbb E X\). The speed of convergence is \(\sqrt{n}.\) \[\Pr\left((A_n-\mathbb EX)^2\geq c\frac{\sqrt{\mathbb VX}}{\sqrt{n}}\right)\leq \frac{1}{c^2}\]

Application: estimate distributions

Using the notation \([Q]=1\) if the question \(Q\) is true and \(0\) if it is false, we have \[\mathbb E[Q]=\Pr(Q)\] Therefore if we do \(n\) experiments and \(N(Q)\) of them are positive for the question \(Q,\) then \[\frac{N(Q)}{n}\underset{n\rightarrow\infty}\longrightarrow \Pr(Q)\]

Some discrete distributions

So far we have worked with random systems where the outcomes are discrete

Only two outcomes, like in a coin
A finite number of outcomes, such as DNA
A natural number, such as the number of successful experiments

Bernoulli distribution

A single “coin”, or any experiment with only two possible outcomes
- \(\Pr(\text{success}) = p\)
- \(\Pr(\text{failure}) = q = 1-p\)
We can encode it as 0 or 1 using \([\text{success}]\)
The expected value of \([\text{success}]\) is \[\mathbb E[\text{success}]=p\]
The variance of \([\text{success}]\) is \[\mathbb V[\text{success}]=pq\]

Binomial distribution

We throw \(n\) “coins”, all independent, all with the same probability of success \(p\)

The number \(B_n\) of successful “coins” is a random variable. It’s distribution is \[\Pr(B_n)=\Pr(k\text{ successes in }n\text{ trials})=\binom{n}{k}p^k q^{n-k}\] It is easy to see that \[\begin{align} \mathbb E(B_n)&=np\\ \mathbb V(B_n)&=npq \end{align}\]

Example

Application

The probability that \(B_n=k\) for each \(k\) can be very small, but for a range is usually bigger

What is the probability that \(B_n\geq a\) for any \(a\)? \[\begin{align}\Pr(B_n\geq a)&=\sum_{k=a}^n\Pr(B_n=k)\\ &=\sum_{k=a}^n\binom{n}{k}p^k q^{n-k}\end{align}\]

General case

What is the probability that \(B_n\) is in the range \([a,b]\)? \[\Pr(a\leq B_n\leq b)=\sum_{k=a}^b\Pr(B_n=k)\] Or, if \(a\) and \(b\) are not integers \[\Pr(a\leq B_n\leq b)=\sum_{a\leq k\leq b}\Pr(B_n=k)\]

Change of variable

We want to see what happens when \(n\) is big. Let \[S_n = \frac{B_n-np}{\sqrt{npq}}\] It is easy to see that \[\begin{align} \mathbb E(S_n)&=0\\ \mathbb V(S_n)&=1 \end{align}\] for all values of \(n\)

Probability

To evaluate \(\Pr(a\leq S_n\leq b)\) we can do \[x_k=\frac{k-np}{\sqrt{npq}}\] so \[\Pr(a\leq S_n\leq b)=\sum_{a\leq x_k\leq b}\binom{n}{k}p^k q^{n-k}\] where \(k=np+x_k\sqrt{npq}\) and \((n-k)=nq-x_k\sqrt{npq}\)

When \(n\) is big

Remember that \[\binom{n}{k}=\frac{n!}{k!(n-k)!}\] When \(n\) is big we can approximate the factorial \[n!=Cn^{n+1/2}e^{-n}\] where \(C\) is a constant that we will find later

This is called Stirling’s approximation and we will explain it later

combinatorial

Now the combinatorial can be written as \[\begin{align} \frac{n!}{k!(n-k)!}&=\frac{Cn^{n+1/2}e^{-n}}{C^2k^{k+1/2}e^{-k} (n-k)^{n-k+1/2}e^{-(n-k)}}\\ &=\frac{n^{n+1/2}}{C k^{k+1/2} (n-k)^{n-k+1/2}}\\ &=\frac{1}{C}\left(\frac{n}{k(n-k)}\right)^{1/2}\left(\frac{n}{k}\right)^k\left(\frac{n}{n-k}\right)^{n-k} \end{align}\]

Binomial formula

\[\binom{n}{k}p^k q^{n-k} =\frac{1}{C}\left(\frac{n}{k(n-k)}\right)^{1/2}\left(\frac{np}{k}\right)^k\left(\frac{nq}{n-k}\right)^{n-k}\] Now \[\frac{n}{k(n-k)}= \frac{n}{(np+x_k\sqrt{npq})(nq-x_k\sqrt{npq})}\approx \frac{1}{npq}\] therefore \[\binom{n}{k}p^k q^{n-k}\approx \frac{1}{C\sqrt{npq}}\left(\frac{np}{k}\right)^k\left(\frac{nq}{n-k}\right)^{n-k}\]

Logarithms!

we have \[\ln\left(\left(\frac{np}{k}\right)^k\right) =-k\ln\left(\frac{k}{np}\right)\\ =-(np+x_k\sqrt{npq})\ln\left(\frac{np+x_k\sqrt{npq}}{np}\right)\\ =-(np+x_k\sqrt{npq})\ln\left(1+x_k\sqrt{\frac{q}{np}}\right)\] with the same procedure we get \[\ln\left(\left(\frac{nq}{n-k}\right)^{n-k}\right) =-(nq-x_k\sqrt{npq})\ln\left(1-x_k\sqrt{\frac{p}{nq}}\right)\]

then

Using now the approximation \(\ln(1+x)\approx x-x^2/2\) we can write \[\ln\left(\left(\frac{np}{k}\right)^k\right) \approx-(np+x_k\sqrt{npq})\left(x_k\sqrt{\frac{q}{np}}-x_k^2\frac{q}{2np}\right)\]

\[\ln\left(\left(\frac{nq}{n-k}\right)^{n-k}\right) \approx-(nq-x_k\sqrt{npq})\left(-x_k\sqrt{\frac{p}{nq}}-x_k^2{\frac{p}{2nq}}\right)\]

\[\ln\left(\left(\frac{np}{k}\right)^k\left(\frac{nq}{n-k}\right)^{n-k}\right) \approx-(x_k\sqrt{npq}-\frac{x_k^2q}{2}+x_k^2q)+(x_k\sqrt{npq}+\frac{x_k^2p}{2}-x_k^2 p)\\ =-(p+q)\frac{x_k^2}{2}= -\frac{x_k^2}{2}\]

In summary

\[\begin{align} \ln\left(\left(\frac{np}{k}\right)^k\left(\frac{nq}{n-k}\right)^{n-k}\right) &\approx-\frac{x_k^2}{2}\\ \left(\frac{np}{k}\right)^k\left(\frac{nq}{n-k}\right)^{n-k} &\approx\exp({-x_k^2/2}) \end{align}\] and therefore \[\binom{n}{k}p^k q^{n-k}\approx\frac{1}{C\sqrt{npq}}\exp({-x_k^2/2})\]

Probability

Recalling that \[\Pr(a\leq S_n\leq b)=\sum_{a\leq x_k\leq b}\binom{n}{k}p^k q^{n-k}\] then \[\Pr(a\leq S_n\leq b)\approx\sum_{a\leq x_k\leq b} \frac{1}{C\sqrt{npq}}\exp({-x_k^2/2})\]

If we call \(h=1/\sqrt{npq}\) then \(h\to 0\) when \(n\to\infty\) and the sum becomes \[\Pr(a\leq S_n\leq b)=\int_{a}^{b}\frac{1}{C}e^{-x^2/2}\]

The \(C\) constant

To finish the formula we have to find \(C\). Using the first rule of probabilities we have \[\Pr(-\infty\leq S_n\leq \infty)=\int_{-\infty}^{\infty}\frac{1}{C}e^{-x^2/2}=1\] therefore \[C=\int_{-\infty}^{\infty}e^{-x^2/2}=\sqrt{2\pi}\]

The Normal distribution

The random variable \(S_n\) was chosen to have \(\mathbb E S_n=0\) and \(\mathbb V S_n=1\)

We have shown that, when \(n\) is big, \(S_n\) is a random variable with values in \(\mathbb R\) that follows a Normal distribution with mean 0 and variance 1.

In general, if \(X\) is a normal random variable with mean \(\mu\) and variance \(\sigma^2\) then \[\Pr(X\leq b)=\int_{-\infty}^{b}\frac{1}{\sqrt{2\pi\sigma^2}}e^{-(x-\mu)^2/2\sigma}\]

Central Limit Theorem

We have shown that the Binomial distribution converges to a Normal distribution when \(n\) grows but the average and variance are fixed.

Since the Binomial distribution is a sum of “coins” \(X_i\), we have shown that if we center and scale the sum of “coins”, all independent, all with the same distribution, then \[\frac{\sum^n X_i-p}{\sqrt{pq/n}}\] will converge to a Normal distribution.

General case of CLT

Given