Methodology of Scientific Research

Sample variance v/s population variance

Variance $$๐X$$ and mean $$๐ผX$$ of the population are often unknown

Usually we only have a small sample $$๐ = (X_1,โฆ,X_n)$$

Assuming that all $$X_i$$ are taken from the same population and are mutually independent, what can we say about the sample mean and variance?

Sample variance

The variance of a set of numbers is easy to calculate

\begin{aligned}\text{var}(X_1,โฆ,X_n)& =\frac{1}{n}\sum_i (X_i-\bar{๐})^2\\ &=\frac{1}{n}\sum_i X_i^2-\bar{๐}^2\end{aligned}

(the average of squares minus the square of averages)

Since the sample is random, this is a random variable. What is its expected value?

Expected value of sample variance

Since $$๐ผ(ฮฑ X+ฮฒY)=ฮฑ๐ผ(X)+ฮฒ๐ผ(Y),$$ we have

\begin{aligned} ๐ผ\text{var}(X_1,โฆ,X_n)&=๐ผ\left(\frac{1}{n}\sum_i X_i^2-\left(\frac{1}{n}\sum_i X_i\right)^2\right)\\ &=\frac{1}{n}\sum_i ๐ผ\left(X_i^2\right)-\frac{1}{n^2}๐ผ \left(\left(\sum_i X_i\right)^2\right)\end{aligned}

First part

Now, since the sample is i.i.d. we have $$๐ผ \left(X_i^2\right)=๐ผ \left(X^2\right)$$ and $\sum_i๐ผ \left(X_i^2\right)=n๐ผ \left(X^2\right)$ therefore $๐ผ\text{var}(X_1,โฆ,X_n)=\frac{1}{n}n ๐ผ\left(X^2\right)-\frac{1}{n^2}๐ผ \left(\sum_i X_i\right)^2$

Second part

We can simplify the second part as $\left(\sum_i X_i\right)^2=\left(\sum_i X_i\right)\left(\sum_j X_j\right)=\sum_i \sum_j X_i X_j$ therefore $๐ผ \left(\sum_i X_i\right)^2=\sum_i \sum_j ๐ผ X_i X_j$ Here we have two cases.

Two cases

If $$i=j,$$ we have $๐ผ X_i X_j = ๐ผ (X_i^2)=๐ผ (X^2)$ If $$iโ j,$$ and since all outcomes are independent, we have $๐ผ X_i X_j = ๐ผ(X_i)๐ผ(X_j)=(๐ผX)^2$ therefore $๐ผ\left(\sum_i X_i\right)^2= n ๐ผ (X^2) + n(n-1)(๐ผ X)^2$

Putting it all together

\begin{aligned} ๐ผ\text{var}(X_1,โฆ,X_n)&=\frac{1}{n}n๐ผ X^2-\frac{1}{n^2}(n ๐ผ (X^2) + n(n-1)(๐ผ X)^2)\\ & =\frac{1}{n}\left((n-1)๐ผ X^2-(n-1)(๐ผ X)^2\right)\\ & =\frac{n-1}{n}(๐ผ X^2-(๐ผ X)^2)\\ & =\frac{n-1}{n}๐X \end{aligned}

In summary

we have found that $๐ผ\text{var}(X_1,โฆ,X_n)= \frac{n-1}{n}๐X$ which is not exactly what we are looking for

Sample variance is biased

If we want to estimate the mean $$๐ผX$$ of a population we can use the sample mean $$\bar{X}$$

But if we want to estimate the variance $$๐X$$ of a population we cannot use the sample variance $$\text{var}(X_1,โฆ,X_n)$$

Instead we have to use a different formula $\hat{๐}(X) = \frac{1}{n-1}\sum_i(X_i-\bar{๐})^2$

Two formulas for variance

People uses two formulas, depending on the case ๐ฑ + If you only care about the sample, its variance is $\text{var}(๐ฑ) =\frac{1}{n}\sum_i (x_i-\bar{๐ฑ})^2=\frac{1}{n}\sum_i x_i^2-\bar{๐ฑ}^2$

• If you care about the population, but only have a sample $\hat{๐}(X) = \frac{1}{n-1}\sum_i(x_i-\bar{๐ฑ})^2 = \frac{n}{n-1}\text{var}(๐ฑ)$

In summary

• When experiments produce numbers we can calculate average and variance

• The population has a fixed mean and variance, even if we do not know their values

• If we have an i.i.d sample we can estimate the population mean with the sample mean

• If the sample is not i.i.d., its mean may not correspond to the population mean

If the sample is i.i.d, then

• The sample mean is probably close to the population mean, independent of the probability distribution

• If the sample is 4 times bigger, the sample mean is 2 times closer to the population mean

• The sample variance is not a good estimation of the population variance.

• We use a different formula in that case.