# Methodology of Scientific Research

## Average age in Türkiye

Take a look at Türkiye age distribution

• What is the average age?
• Are men older or younger than women, on average?

## Averages

In everyday life, if $$𝐱 = \{x_1,…,x_n\}$$ we have $\text{mean}(𝐱)= \frac{1}{n}\sum_i x_i$

## Using proportions

Now, if we count how many of each different value are there $m_j = \text{number of times that }(x_i=j)$ Then we can write $\text{mean}(𝐱) =\sum_j j⋅\frac{m_j}{n}$

In other words, to calculate the average we need to know the proportions

## In detail

\begin{aligned} \text{mean}(𝐱)& = \frac{1}{n}\sum_i x_i\\ & = \frac{1}{n} (\underbrace{1+\cdots+1}_{m_1}+ \underbrace{2+\cdots+2}_{m_2}+\cdots)\\ & = \frac{1}{n} (1⋅m_1 + 2⋅m_2+3⋅m_3+\cdots)\\ & = 1⋅\frac{m_1}{n} + 2⋅\frac{m_2}{n} +\cdots =\sum_j j⋅ \frac{m_j}{n} \end{aligned}

## Exercise

• Calculate the average age of the population of Türkiye

• Calculate the average age of men and women in Türkiye

• Compare them

# When Outcomes are Numbers

## Probability Distribution

Let’s say that $$Ω=\{a_1, a_2, …,a_n\}$$

The probability distribution is a function $p: Ω → [0,1]$

$p(a_i) = ℙ(X=a_i)= ℙ(\text{outcome is exactly }a_i)$

## Probabilities as proportions

There are $$m_c$$ cards of color $$c\in$${“red”,“green”,“blue”, “yellow”}

There are $$n=\sum_c m_c$$ cards in total

If we do not have any reason to expect any order of cards, then each individual card has the same probability $$1/n$$

The probability of “first color $$c$$” is $ℙ(\text{color is }c)=\frac{m_c}{n}$

## When outcomes are numbers

The most important applications of probabilities are when the outcomes are numbers

More in general, we care about numbers that depend on the experiment outcome

• dice: $$↦ 1$$, $$↦ 2$$, …, $$↦ 6$$
• coins: “Heads” $$↦ 1$$, “Tails” $$↦ 0$$
• temperature
• number of cells
• anything we measure

## We can do arithmetic with numbers

If the outcomes are numbers, we can use them in formulas

For example, if “Heads $$↦1$$ and Tails $$↦0$$”, then we can ask

“What is the sum when we throw $$N$$ coins?”

Or if $$↦ 1$$, $$↦ 2$$, …, $$↦ 6$$ we can ask

“What is the average sum of two dice?”

## Random variable

The case where outcomes are numbers is so important that it has a special name

We call them random variables

We represent them with capital letters, like $$X$$

Then we can ask: “$$X>1$$” or “$$X=2$$” or “$$X=x$$

In this last example $$x$$ is a fixed number, and $$X$$ is random

## Expected value – Mean value

For any random variable $$X$$ we define the expected value (also called mean value) of $$X$$ as its average over the population $𝔼X=\sum_{y∈Ω} y\, ℙ(X=y)$ Notice that $$X$$ is a random variable but $$𝔼X$$ is not.

Sometimes, for a given random variable, we write $$\mu=𝔼X$$

## Generalizing

If $$f:ℝ\to ℝ$$ is a function, like for example $f(x) = x^2\qquad\text{or}\qquad f(x)=\sqrt{x}$ then we can get the expected value of $$f(X)$$ $𝔼\,f(X)=\sum_{y∈Ω} f(y)\, ℙ(X=y)$

## Expected value is linear

If $$X$$ and $$Y$$ are random variables, and $$\alpha$$ is any number, then

$𝔼(X + Y)=𝔼X + 𝔼Y$ $𝔼(α X)=α\, 𝔼X$

So, if $$α$$ and $$β$$ are fixed numbers, then

$𝔼(α X +\beta Y)=α\, 𝔼X +β\, 𝔼Y$

Exercise: prove it yourself

## Variance of the population

The variance of the population is defined with the same idea as the sample variance $𝕍 X=𝔼(X-𝔼X)^2$ Notice that the variance has squared units

## Standard deviation of the population

In most cases it is more comfortable to work with the standard deviation of the population $\sigma=\sqrt{𝕍X}$

In that case the population variance can be written as $$\sigma^2$$

## Simple formula for population variance

We can rewrite the variance of the population as: $𝕍X=𝔼(X-𝔼X)^2=𝔼(X^2)-(𝔼X)^2$ because $𝔼(X-𝔼X)^2=𝔼(X^2-2X 𝔼X+(𝔼X)^2)\\=𝔼(X^2)-2𝔼(X 𝔼X)+𝔼(𝔼X)^2$ but $$𝔼X$$ is not random, so $$𝔼(X 𝔼X)=(𝔼X)^2$$ and $$𝔼(𝔼X)^2=(𝔼X)^2$$

## Variance is almost linear

if $$X$$ and $$Y$$ are two independent random variables, and $$\alpha$$ is a real number, then

• $$𝕍(X + Y)=𝕍 X + 𝕍 Y$$
• $$𝕍(α X)=α^2 𝕍 X$$

To prove the first equation we use that $$𝔼(XY)=𝔼X\,𝔼Y,$$ which is true when $$X$$ is independent of $$Y$$

# A sample is not the population

## Average of a sample

Let’s assume that we have a small sample $$𝐗=(X_1,…,X_n)$$

All $$X_i$$ are random variables taken from the same population.
We take their sample mean: $\text{mean}(𝐗)=\bar{𝐗}=\frac{1}{n}\sum_i X_i$ Since the sample is random, $$\bar{𝐗}$$ is also a random variable

What is the expected value of $$\bar{𝐗}$$?

## Expected value of sample mean

By definition of mean, we have $𝔼\,\text{mean}(𝐗) = 𝔼(\bar{𝐗})=𝔼\left(\frac{1}{n}\sum_i X_i\right)$ and since $$𝔼(α X+βY)=α𝔼(X)+β𝔼(Y),$$ we have $𝔼(\bar{𝐗})=𝔼\left(\frac{1}{n}\sum_i X_i\right)=\frac{1}{n}𝔼\sum_i X_i=\frac{1}{n}\sum_i𝔼 X_i$

## Independent from the same population

All outcomes in the sample are identically distributed because they come from the sample population (which does not change with the sample)

Let’s assume that each outcomes is independent

In that case we will have and i.i.d. sample, and

• All $$𝔼 X_i$$ are equal. We call them $$𝔼 X$$
• All $$𝕍 X_i$$ are equal. We call them $$𝕍 X$$
• $$X_i$$ is independent of $$X_j$$ when $$i≠j$$

## All $$X_i$$ are i.i.d.

Since all $$X_i$$ come from the same population, $$𝔼 X_i=𝔼 X$$ and $𝔼(\bar{𝐗})=\frac{1}{n}\sum_i𝔼 X=\frac{n}{n}𝔼 X=𝔼 X$

Good!

The expected value of the sample average is the expected value of the complete population

## Sample variance

The variance of a set of numbers is easy to calculate

\begin{aligned}\text{var}(𝐗)& =\frac{1}{n}\sum_i (X_i-\bar{𝐗})^2\\ &=\frac{1}{n}\sum_i X_i^2-\bar{𝐗}^2\end{aligned}

(Remember: the average of squares minus the square of averages)

Since the sample is random, this is also a random variable.
What is its expected value?

## Expected value of sample variance

Since $$𝔼(α X+βY)=α𝔼(X)+β𝔼(Y),$$ we have

\begin{aligned} 𝔼\text{var}(𝐗)&=𝔼\left(\frac{1}{n}\sum_i X_i^2-\left(\frac{1}{n}\sum_i X_i\right)^2\right)\\ &=\frac{1}{n}\sum_i 𝔼\left(X_i^2\right)-\frac{1}{n^2}𝔼 \left(\left(\sum_i X_i\right)^2\right)\end{aligned}

## First part

Now, since the sample is i.i.d. we have $$𝔼 \left(X_i^2\right)=𝔼 \left(X^2\right)$$ and $\sum_i𝔼 \left(X_i^2\right)=n𝔼 \left(X^2\right)$ therefore $𝔼\text{var}(𝐗)=\frac{1}{n}n 𝔼\left(X^2\right)-\frac{1}{n^2}𝔼 \left(\sum_i X_i\right)^2$

## Second part

We can simplify the second part as $\left(\sum_i X_i\right)^2=\left(\sum_i X_i\right)\left(\sum_j X_j\right)=\sum_i \sum_j X_i X_j$ therefore $𝔼 \left(\sum_i X_i\right)^2=\sum_i \sum_j 𝔼 X_i X_j$

## Here we have two cases

If $$i=j,$$ we have $$𝔼 X_i X_j = 𝔼 (X_i^2)=𝔼 (X^2)$$ If $$i≠j,$$ and since all outcomes are independent, we have $𝔼 X_i X_j = 𝔼(X_i)𝔼(X_j)=(𝔼X)^2$ therefore $𝔼\left(\sum_i X_i\right)^2= n 𝔼 (X^2) + n(n-1)(𝔼 X)^2$

## Putting it all together

\begin{aligned} 𝔼\text{var}(𝐗) &=\frac{1}{n}n𝔼 X^2-\frac{1}{n^2}(n 𝔼 (X^2) + n(n-1)(𝔼 X)^2)\\ & =\frac{1}{n}\left((n-1)𝔼 X^2-(n-1)(𝔼 X)^2\right)\\ & =\frac{n-1}{n}(𝔼 X^2-(𝔼 X)^2)\\ & =\frac{n-1}{n}𝕍X \end{aligned}

## In summary

we have found that $𝔼\text{var}(𝐗) = \frac{n-1}{n}𝕍X$ So the variance of the sample is not the variance of the population

Not even on average

## Sample variance is biased

If we want to estimate the mean $$𝔼X$$ of a population we can use the sample mean $$\bar{X}$$

But if we want to estimate the variance $$𝕍X$$ of a population we cannot use the sample variance $$\text{var}(𝐗)$$

Instead we have to use a different formula $\hat{𝕍}(𝐗) = \frac{1}{n-1}\sum_i(X_i-\bar{𝐗})^2$

## Two formulas for variance

People uses two formulas, depending on the case 𝐱

• If you only care about the sample, its variance is $\text{var}(𝐗) =\frac{1}{n}\sum_i (X_i-\bar{𝐗})^2=\frac{1}{n}\sum_i X_i^2-\bar{𝐗}^2$

• If you care about the population, but only have a sample $\hat{𝕍}(𝐗) = \frac{1}{n-1}\sum_i(X_i-\bar{𝐗})^2 = \frac{n}{n-1}\text{var}(𝐗)$

## Summary

• When experiments produce numbers we can calculate average and variance

• The population has a fixed mean and variance, and most times we do not know their values

• If we have an i.i.d. sample we can estimate the population mean with the sample mean

• If the sample is not i.i.d., its mean may not correspond to the population mean

## If the sample is i.i.d, then

• The sample mean is probably close to the population mean, independent of the probability distribution
• The sample variance is not a good estimation of the population variance.
• We use a different formula in that case.