# Methodology of Scientific Research

## How many people to interview?

If you do not have access to the age distribution, but you know only the standard deviation

• How many people you need to interview to estimate the average age of the Turkish population with a margin of error of 5 years?
• β¦ of 1 year?
• β¦ of 1 month?
• What is the probability that the real value is inside the intervals you have found?

## sample mean v/s population mean

We are looking for population mean $$πΌX$$

We know the population variance $$πX$$

We interview $$n$$ people and calculate $$\bar{π}$$

## A confidence interval

The population average is probably in the interval $\left[\bar{π}-c\sqrt{π(X)/n}, \bar{π}+c\sqrt{π(X)/n}\right]$

Using Chebyshevβs inequality, we know that the probability is at least $$1-1/c^2$$

## Interval width

We want the interval width to be less than 5 (or 1, or 1/12) years

Letβs say $yβ€2c\sqrt{π(X)/n}$ therefore $nβ₯4c^2π(X)/y^2$

## Evaluating with known values

Variance π(X) is 473.23 yr2

c Prob 5 yr 1 yr 1 month
2 75% 1,515 7,572 90,860
3 89% 3,408 17,037 204,435
5 96% 9,465 47,323 567,874
10 99% 37,859 189,292 2,271,496

# Insurance

Homework Question 1

## Statement

A company wants to offer insurance to protect against the economic damage of COVID-19.

• If a person takes the insurance, they pay π₯.
• If they get COVID in the next year , then they got paid a fixed amount π¦
• this happens with probability π
• After one year, the company will have a net result π corresponding to income minus expenses.
• Since expenses depend on how many people get sick, π is a random variable.

## Questions

• What is the expected value of the net result π?
• What are the variance and standard deviation of the net result π?
• What is the interval that contains the real net result π with 99% probability?

## Calculating net result π

We have $$n$$ people paying, and $$π$$ people getting sick. The result is $R=nx-πy$ For our analysis, $$n, x$$ and $$y$$ are fixed, but $$π$$ is a random variable. Thus $$R$$ is a random variable.

We want to know $$πΌ(R)$$

How can we calculate it?

## Expected value of π

Using the definition, we have $πΌ(R)=πΌ(nx-sy)=nx-πΌ(π)y$ So we need to calculate $$πΌ(π)$$

What do we know about $$π$$?

## π is the number of sick people

There are $$n$$ people, each one can get sick with probability $$p$$

Each person is a βcoinβ with probability $$p$$

Thus $$π$$ is a sum of coins

## π follows a Binomial distribution

Assuming that each person gets sick independently, then $π \sim Binom(n,p)$ Therefore, we immediately know that $πΌ(π)=np\qquad π(π)=np(1-p)$

• What is the expected value of the net result π? $πΌ(R)=nx-πΌ(π)y=nx-npy$
• What are the variance and standard deviation of the net result π? $π(R)=π(nx)+π(-πy)=0+ π(π)y^2 =np(1-p)y^2$
• What is the interval that contains the real net result π with 99% probability?

## Confidence interval

After one year, the result $$R$$ will be somewhere $\left[πΌ(R)-c\sqrt{π(R)}, πΌ(R)+c\sqrt{π(R)}\right]$ That is $\left[nx-npy-cy\sqrt{np(1-p)}, nx-npy+cy\sqrt{np(1-p)}\right]$

How do we choose $$x$$ and $$y$$?

## How not to get broke

We want $$Rβ₯0,$$ so the lower limit of the interval must be positive $nx-npy-cy\sqrt{npq}β₯0$ thus $\frac{x}{y}β₯p+c\sqrt{\frac{p(1-p)}{n}}$

## With numbers

Assuming $$p=0.1,$$ then $$x/y$$ must be at least

c Prob 10 100 1000 10000 100000
2 75% 0.29 0.16 0.12 0.11 0.10
3 89% 0.38 0.19 0.13 0.11 0.10
5 96% 0.57 0.25 0.15 0.12 0.10
10 99% 1.05 0.40 0.19 0.13 0.11

# Can we do better?

## Chebyshev is pessimistic

We used Chebyshev formula, which does not need any hypothesis

But we have more information. We know that $$π$$ is a Binomial random variable

Therefore we can make better confidence intervals

## Calculating β(π=x)

We know that $β(π=k|n\text{ in total})=\binom{n}{k} p^k(1-p)^{n-k}$ We can calculate $$\binom{n}{k}$$ using Pascalβs triangle, even in Excel

## Binomial coefficient in Excel

Pascalβs Triangle

## Making the interval for π

$β(πβ€k)=\sum_{j=0}^k β(π=j)$

## Use better tools

Good tools include functions to calculate the usual distributions

In Excel we have BINOM.DIST(k, n, p, cumulative)

In R we have pbinom() and dbinom()

# What happens when π is big?

## A simple model

Now we have a coin π with two possible outcomes: +1 and -1

To make life easy, we assume π=0.5

What are the expected value and variance of X ?

## Throw the coin π times

We throw the coin π times, and we calculate π, the sum of all π $Y=\sum_{i=1}^π X_i$

What are the expected value and variance of π ?

## It is easy to see that

• π is basically a Binomial random variable
• πΌπ = 0, because πΌπ = 0
• ππ = π, because ππ = 1

## Fix the variance to 1

Now consider $$Z_n=Y/\sqrt{π}$$

It is easy to see that $$πΌZ_n = 0$$ and $$πZ_n = 1$$ independent of π

The possible values of $$Z_n$$ are not integers. Not even rationals

What happens with $$Z_n$$ when π is really big?

## Central limit theorem

When $$nββ,$$ the distribution of $$Z_n=β X/\sqrt{π}$$ will converge toa Normal distribution $\lim_{nββ} Z_n βΌ Normal(0,1)$

## More in general

If $$X_i$$ is a set of independent, identically distributed random variables, with expected value $πΌX_i=ΞΌ\quad\text{for all }i$ and variance $πX_i=Ο^2\quad\text{for all }i$ then, when $$n$$ is large $\lim_{nββ} \frac{\sum_i X_i-ΞΌ}{Ο\sqrt{π}} βΌ Normal(0,1)$

## In other words

If $$X_i$$ is a set of independent, identically distributed random variables, with expected value $πΌX_i=ΞΌ\quad\text{for all }i$ and variance $πX_i=Ο^2\quad\text{for all }i$ then, when $$n$$ is large $\lim_{nββ} \frac{\sum_i X_i-ΞΌ}{\sqrt{π}} βΌ Normal(0, Ο^2)$

## Noise is usually Normal

• Thermal noise is the sum of many small vibrations in all directions
• they sum and usually cancel each other
• Phenotype depends on several genetic conditions
• Height, weight and similar attributes depend on the combination of several attributes

## It does not always work

• Not all combined effects are sums
• some effects are multiplicative
• Some effects may not have finite variace
• sometimes variance is infinite
• Not all effects are independent
• this is the most critical issue