6 December 2017

Different questions

If \(X\) is a random variable that follows some distribution, we can ask several questions

  • What is the expected value of \(X\)?
  • What is the variance of \(X\)?

These two questions can only be answered if we know the full population

In most cases we do not know the real population, but we can assume we know it

Exercise 1

Let’s say that we know that the expected value of \(X\) is \(\mu\) and the variance of \(X\) is \(\sigma^2\): \[\mathbb EX=\mu\qquad\mathbb VX=\sigma^2\]

Then we can ask: What is the probability that \(X\) is more than any value \(a\)? \[\Pr(X > a \vert\mu, \sigma^2)?\]

Exercise: Answer that for Binomial \((\mu=50,\sigma^2=25)\) and Normal \((\mu=0, \sigma=1)\) distributions, for different values of \(a.\)

Exercise 2

What is the probability that \(X\) is inside the interval \([a_1,a_2]\)? \[\Pr(a_1\leq X\leq a_2 \vert\mu, \sigma^2)?\]

Exercise: Answer that for Binomial and Normal distributions

Can we replace the Binomial \((\mu=50,\sigma^2=25)\) by a Normal \((\mu=50,\sigma^2=25)\)?

Inverse question

Since in reality we don’t know \(\mu,\) we would like to ask about it:

What is the probability that \(\mu\) is in the range \([b_1,b_2],\) given that in our experiment \(X\) had a value \(a\)?

If \(b_1\) and \(b_2\) are fixed, that question in useless. The answer is either 1 or 0, since \(\mu\) is not random. Instead we want to find two functions \(b_1(X)\) and \(b_2(X)\) depending on the experiment result \(X\) such that

\[\Pr(b_1(X)<\mu<b_2(X))=1-\alpha\]

where \(\alpha\) is a small number, typically 0.05 or 0.01

Confidence interval

Exercise: Find 90% confidence intervals + For a Binomial distribution + For a Normal distribution

Formula

If \(X\) follows a Normal\((\mu,\sigma^2)\), then the value \[Z=\frac{X-\mu}{\sigma}\] is also random and follows a Normal(0,1). In particular the average \(\bar x\) of a sample (i.i.d.) is Normal\((\mu,\sigma^2/n),\) so \[Z=\frac{\bar x-\mu}{\sigma/\sqrt{n}}\] is also Normal(0,1)

Formula

Therefore we can calculate \(\Pr(c_1<Z<c_2)\) for any \(c_1,c_2.\)

Since the Normal distribution is symmetrical around 0 we can choose \(c_1=-c_2.\) That will give us the narrowest interval

So, given a confidence level \(1-\alpha,\) we look for \(c\) such that \[\Pr(-c<Z<c)=1-\alpha\]

Finding \(c\)

Again, since the normal distribution is symmetric, \(c\) will be such that \[\Pr(Z< -c)=\Pr(Z>c)=\alpha/2\] This is the value we have to find in a table, or using R

qnorm(1-alpha/2)
qnorm(alpha/2, lower.tail = F)

Graphic

What about the interval?

Once we found \(c\) we can build our interval. If \(-c<Z<c\) then \[-c<\frac{\bar x-\mu}{\sigma/\sqrt{n}}<c\] so \[-c\sigma/\sqrt{n}<\bar x-\mu<c\sigma/\sqrt{n}\] then \[\bar x-c\sigma/\sqrt{n}<\mu<\bar x+c\sigma/\sqrt{n}\]

In summary

If the average follows a Normal distribution and we know the population variance, then a confidence interval for \(\mu\) is \[\begin{aligned} b_1&=\bar x-c(\alpha)\sigma/\sqrt{n}\\ b_2&=\bar x+c(\alpha)\sigma/\sqrt{n} \end{aligned}\] with \(c(\alpha)\) is taken from the Normal(0,1) table

But we don’t know the population variance

Can we use the sample variance?

  • No, because it is biased
  • But we can use the unbiased variance estimator \[S_n=\frac{1}{n-1}\sum_{i=1}^n(x_i-\bar x)^2\]
  • and we have to pay a price

Price of ignorance

Since we do not know \(\sigma^2\) we have to estimate it using the data.

But we use the same data to calculate \(\bar x\)

Thus, \(\bar x\) and \(S_n\) are not independent

T instead of Z

Now instead of \[Z=\frac{\bar x-\mu}{\sigma/\sqrt{n}}\] we define \[T=\frac{\bar x-\mu}{S_n/\sqrt{n}}\] which does not follow a Normal, but a Student’s t-distribution

Student’s t-distribution

The “frequency distribution of standard deviations of samples drawn from a normal population”

This is a family of distributions, depending on a parameter called “degrees of freedom”

Since the sample has size \(n\) we initially have \(n\) degrees of freedom. But the average is fixed, so we lose one degree of freedom

Student’s t-distribution

Some typical interval limits

DF 90% 95% 99%
1 6.31 12.71 63.66
2 2.92 4.30 9.92
3 2.35 3.18 5.84
4 2.13 2.78 4.60
5 2.02 2.57 4.03
30 1.70 2.04 2.75
Normal 1.64 1.96 2.58