# Methodology of Scientific Research

## Sample variance v/s population variance

Variance $$𝕍X$$ and mean $$𝔼X$$ of the population are often unknown

Usually we only have a small sample $$𝐗 = (X_1,…,X_n)$$

Assuming that all $$X_i$$ are taken from the same population and are mutually independent, what can we say about the sample mean and variance?

## Sample variance

The variance of a set of numbers is easy to calculate

\begin{aligned}\text{var}(X_1,…,X_n)& =\frac{1}{n}\sum_i (X_i-\bar{𝐗})^2\\ &=\frac{1}{n}\sum_i X_i^2-\bar{𝐗}^2\end{aligned}

(the average of squares minus the square of averages)

Since the sample is random, this is a random variable. What is its expected value?

## Expected value of sample variance

Since $$𝔼(α X+βY)=α𝔼(X)+β𝔼(Y),$$ we have

\begin{aligned} 𝔼\text{var}(X_1,…,X_n)&=𝔼\left(\frac{1}{n}\sum_i X_i^2-\left(\frac{1}{n}\sum_i X_i\right)^2\right)\\ &=\frac{1}{n}\sum_i 𝔼\left(X_i^2\right)-\frac{1}{n^2}𝔼 \left(\left(\sum_i X_i\right)^2\right)\end{aligned}

## First part

Now, since the sample is i.i.d. we have $$𝔼 \left(X_i^2\right)=𝔼 \left(X^2\right)$$ and $\sum_i𝔼 \left(X_i^2\right)=n𝔼 \left(X^2\right)$ therefore $𝔼\text{var}(X_1,…,X_n)=\frac{1}{n}n 𝔼\left(X^2\right)-\frac{1}{n^2}𝔼 \left(\sum_i X_i\right)^2$

## Second part

We can simplify the second part as $\left(\sum_i X_i\right)^2=\left(\sum_i X_i\right)\left(\sum_j X_j\right)=\sum_i \sum_j X_i X_j$ therefore $𝔼 \left(\sum_i X_i\right)^2=\sum_i \sum_j 𝔼 X_i X_j$ Here we have two cases.

## Two cases

If $$i=j,$$ we have $𝔼 X_i X_j = 𝔼 (X_i^2)=𝔼 (X^2)$ If $$i≠j,$$ and since all outcomes are independent, we have $𝔼 X_i X_j = 𝔼(X_i)𝔼(X_j)=(𝔼X)^2$ therefore $𝔼\left(\sum_i X_i\right)^2= n 𝔼 (X^2) + n(n-1)(𝔼 X)^2$

## Putting it all together

\begin{aligned} 𝔼\text{var}(X_1,…,X_n)&=\frac{1}{n}n𝔼 X^2-\frac{1}{n^2}(n 𝔼 (X^2) + n(n-1)(𝔼 X)^2)\\ & =\frac{1}{n}\left((n-1)𝔼 X^2-(n-1)(𝔼 X)^2\right)\\ & =\frac{n-1}{n}(𝔼 X^2-(𝔼 X)^2)\\ & =\frac{n-1}{n}𝕍X \end{aligned}

## In summary

we have found that $𝔼\text{var}(X_1,…,X_n)= \frac{n-1}{n}𝕍X$ which is not exactly what we are looking for

## Sample variance is biased

If we want to estimate the mean $$𝔼X$$ of a population we can use the sample mean $$\bar{X}$$

But if we want to estimate the variance $$𝕍X$$ of a population we cannot use the sample variance $$\text{var}(X_1,…,X_n)$$

Instead we have to use a different formula $\hat{𝕍}(X) = \frac{1}{n-1}\sum_i(X_i-\bar{𝐗})^2$

## Two formulas for variance

People uses two formulas, depending on the case 𝐱 + If you only care about the sample, its variance is $\text{var}(𝐱) =\frac{1}{n}\sum_i (x_i-\bar{𝐱})^2=\frac{1}{n}\sum_i x_i^2-\bar{𝐱}^2$

• If you care about the population, but only have a sample $\hat{𝕍}(X) = \frac{1}{n-1}\sum_i(x_i-\bar{𝐱})^2 = \frac{n}{n-1}\text{var}(𝐱)$

## In summary

• When experiments produce numbers we can calculate average and variance

• The population has a fixed mean and variance, even if we do not know their values

• If we have an i.i.d sample we can estimate the population mean with the sample mean

• If the sample is not i.i.d., its mean may not correspond to the population mean

## If the sample is i.i.d, then

• The sample mean is probably close to the population mean, independent of the probability distribution

• If the sample is 4 times bigger, the sample mean is 2 times closer to the population mean

• The sample variance is not a good estimation of the population variance.

• We use a different formula in that case.