Class 6.2: Estimating the population variance

Methodology of Scientific Research

Andrรฉs Aravena, PhD

28 April 2021

Sample variance v/s population variance

Variance \(๐•X\) and mean \(๐”ผX\) of the population are often unknown

Usually we only have a small sample \(๐— = (X_1,โ€ฆ,X_n)\)

Assuming that all \(X_i\) are taken from the same population and are mutually independent, what can we say about the sample mean and variance?

Sample variance

The variance of a set of numbers is easy to calculate

\[\begin{aligned}\text{var}(X_1,โ€ฆ,X_n)& =\frac{1}{n}\sum_i (X_i-\bar{๐—})^2\\ &=\frac{1}{n}\sum_i X_i^2-\bar{๐—}^2\end{aligned}\]

(the average of squares minus the square of averages)

Since the sample is random, this is a random variable. What is its expected value?

Expected value of sample variance

Since \(๐”ผ(ฮฑ X+ฮฒY)=ฮฑ๐”ผ(X)+ฮฒ๐”ผ(Y),\) we have

\[\begin{aligned} ๐”ผ\text{var}(X_1,โ€ฆ,X_n)&=๐”ผ\left(\frac{1}{n}\sum_i X_i^2-\left(\frac{1}{n}\sum_i X_i\right)^2\right)\\ &=\frac{1}{n}\sum_i ๐”ผ\left(X_i^2\right)-\frac{1}{n^2}๐”ผ \left(\left(\sum_i X_i\right)^2\right)\end{aligned}\]

First part

Now, since the sample is i.i.d. we have \(๐”ผ \left(X_i^2\right)=๐”ผ \left(X^2\right)\) and \[\sum_i๐”ผ \left(X_i^2\right)=n๐”ผ \left(X^2\right)\] therefore \[๐”ผ\text{var}(X_1,โ€ฆ,X_n)=\frac{1}{n}n ๐”ผ\left(X^2\right)-\frac{1}{n^2}๐”ผ \left(\sum_i X_i\right)^2\]

Second part

We can simplify the second part as \[\left(\sum_i X_i\right)^2=\left(\sum_i X_i\right)\left(\sum_j X_j\right)=\sum_i \sum_j X_i X_j\] therefore \[๐”ผ \left(\sum_i X_i\right)^2=\sum_i \sum_j ๐”ผ X_i X_j\] Here we have two cases.

Two cases

If \(i=j,\) we have \[๐”ผ X_i X_j = ๐”ผ (X_i^2)=๐”ผ (X^2)\] If \(iโ‰ j,\) and since all outcomes are independent, we have \[๐”ผ X_i X_j = ๐”ผ(X_i)๐”ผ(X_j)=(๐”ผX)^2\] therefore \[๐”ผ\left(\sum_i X_i\right)^2= n ๐”ผ (X^2) + n(n-1)(๐”ผ X)^2\]

Putting it all together

\[\begin{aligned} ๐”ผ\text{var}(X_1,โ€ฆ,X_n)&=\frac{1}{n}n๐”ผ X^2-\frac{1}{n^2}(n ๐”ผ (X^2) + n(n-1)(๐”ผ X)^2)\\ & =\frac{1}{n}\left((n-1)๐”ผ X^2-(n-1)(๐”ผ X)^2\right)\\ & =\frac{n-1}{n}(๐”ผ X^2-(๐”ผ X)^2)\\ & =\frac{n-1}{n}๐•X \end{aligned}\]

In summary

we have found that \[๐”ผ\text{var}(X_1,โ€ฆ,X_n)= \frac{n-1}{n}๐•X\] which is not exactly what we are looking for

Sample variance is biased

If we want to estimate the mean \(๐”ผX\) of a population we can use the sample mean \(\bar{X}\)

But if we want to estimate the variance \(๐•X\) of a population we cannot use the sample variance \(\text{var}(X_1,โ€ฆ,X_n)\)

Instead we have to use a different formula \[\hat{๐•}(X) = \frac{1}{n-1}\sum_i(X_i-\bar{๐—})^2\]

Two formulas for variance

People uses two formulas, depending on the case ๐ฑ + If you only care about the sample, its variance is \[\text{var}(๐ฑ) =\frac{1}{n}\sum_i (x_i-\bar{๐ฑ})^2=\frac{1}{n}\sum_i x_i^2-\bar{๐ฑ}^2\]

  • If you care about the population, but only have a sample \[\hat{๐•}(X) = \frac{1}{n-1}\sum_i(x_i-\bar{๐ฑ})^2 = \frac{n}{n-1}\text{var}(๐ฑ)\]

In summary

  • When experiments produce numbers we can calculate average and variance

  • The population has a fixed mean and variance, even if we do not know their values

  • If we have an i.i.d sample we can estimate the population mean with the sample mean

  • If the sample is not i.i.d., its mean may not correspond to the population mean

If the sample is i.i.d, then

  • The sample mean is probably close to the population mean, independent of the probability distribution

  • If the sample is 4 times bigger, the sample mean is 2 times closer to the population mean

  • The sample variance is not a good estimation of the population variance.

    • We use a different formula in that case.