# Methodology of Scientific Research

## Probability Distribution

Last class we showed that the sum of all outcomes’ probabilities is 1 $ℙ(\{a_1\}) + ℙ(\{a_2\}) + … + ℙ(\{a_n\})=1$ If we know these values, we can calculate everything.

The set of values for all $$i$$ $p(a_i) = ℙ(\{a_i\})= ℙ(\textrm{outcome is exactly }a_i)$ is called the distribution of the probability.

## It has two parts

This definition makes sense only if we agree on what are all the possible outcomes.

In other words, we must agree on what is $$Ω$$

Then the probability distribution is a function $p: Ω → [0,1]$

Notice that there may be more than one way to define $$Ω$$

## Shuffling

The easiest case to study is mixing a set of cards

We shuffle cards several times until we cannot longer know which card will come first

We are interested in the event “the next card will be green”

## Probabilities as proportions

Let’s assume that we know how many cards of each color are in the deck

There are $$n_c$$ cards of color $$c\in$${“red”,“green”,“blue”, “yellow”}

There are $$N=∑ n_c$$ cards in total

If we do not have any solid reason to expect any particular order of cards, then each individual card has the same probability $$1/N$$

The probability of “first color $$c$$” is $ℙ(\textrm{color is }c)=\frac{n_c}{N}$

## Probability of the next card

We will continue drawing cards, so the proportions will change

Let’s say we got color $$c_1$$ in the first draw

Now we have $$N-1$$ cards in total, and there are $$n_{c_1}-1$$ cards of color $$c_1$$

The probability of “second color $$c$$” is

$ℙ(\textrm{second color is }c|\textrm{first color is }c_1)=\begin{cases} \frac{n_c}{N-1} &\textrm{if }c≠c_1\\ \frac{n_c-1}{N-1} &\textrm{if }c=c_1 \end{cases}$

It gets complicated

## Making life easier

The formula applies when our measurement changes the experiment

There are two exceptions where proportions do not change

1. If every $$n_c$$ is large, then $$n_c-1≈ n_c$$ and $$N-1 ≈ N$$
• This is the case when we interview a few people from a large population
2. If we put the card back into the deck and shuffle it again
• This is called sampling with replacement

In practice, we are often sample from a very large population and we model it as a sampling with replacement

## Independent, Identically Distributed

If we replace the card into the deck after we see it, we will have

$ℙ(\textrm{second color is }c|\textrm{first color is }c_1)=ℙ(\textrm{color is }c)=\frac{n_c}{N}$

Notice that this means that the second result is independent of the first result, and so on

Moreover, the distribution is identical in each case

This is a very important case, and we give it a name

Independent, Identically Distributed (i.i.d.)

## Simple case: a coin

• Has 2 sides: $$\Omega=\{\text{'Head'}, \text{'Tail'}\}$$
• Distribution given by $$ℙ(\text{'Head'})$$ and $$ℙ(\text{'Tail'})$$ such that $ℙ(\text{'Head'}) + ℙ(\text{'Tail'})= 1$

All can be reduced to the value $p=ℙ(\text{'Head'})$

We say that the probability distribution of the coin depends on the parameter $$p$$

(In math this is called a Bernoulli distribution)

## Several coins

What is the probability that we get $$k$$ heads if we throw $$N$$ coins?

This happens to be one of the most useful cases for us

Let’s assume that all coins are i.i.d. with $$ℙ(\text{'Head'})=p$$

To simplify, we will call $$ℙ(\text{'Tail'})=q$$ so $$p+q=1$$

To understand this case, we should start with small values of $$N$$

## Two coins

• There is only one way to get 0 heads: TT
• this happens with probability $$q^2$$
• There are two ways of getting 1 head: HT and TH
• this happens with probability $$pq$$
• There is only one way to get 2 heads: HH
• this happens with probability $$p^2$$

we get $1⋅ q^2,\quad 2⋅ p q,\quad 1⋅ p^2$

## Three coins

• There is only 1 way to get 0 heads: TTT
• this happens with probability $$q^3$$
• There are 3 ways of getting 1 head: HTT, THT, and TTH
• this happens with probability $$pq^2$$
• There are 3 ways of getting 2 head: THH, HTH, and HHT
• this happens with probability $$p^2q$$
• There is only one way to get 3 heads: HHH
• this happens with probability $$p^3$$

we get $1⋅ q^3,\quad 3⋅ pq^2,\quad 3⋅ p^2q,\quad 1⋅ p^3$

## This is like $$(a+b)^2$$

The rule of combinations are the same as in the binomial theorem

We get $ℙ(k\textrm{ Heads in }N\textrm{ coins})= \binom{N}{k} p^k q^{(N-k)}$

The numbers $$\binom{N}{k}$$ are found in Pascal’s triangle

## Binomial formula

One way to remember it is to use the formula $(p+q)^N =\sum_{k=0}^N \binom{N}{k} p^k q^{(N-k)}$

This is why we call it Binomial distribution

# When Outcomes are Numbers

## Random variables

The most important applications of probabilities are when the outcomes are numbers

More in general, we care about numbers that depend on the experiment outcome

• dice: $$⚀↦1, ⚁↦2,…,⚅↦6$$
• coins: Heads $$↦1$$, Tails $$↦0$$
• temperature
• number of cells
• anything we measure

## We can do math with numbers

If the outcomes are numbers, we can use them in formulas

For example, if coins are “Heads $$↦1$$ and Tails $$↦0$$”, then $ℙ(k\textrm{ Heads in }N\textrm{ coins})$ is the same as $ℙ\left(\sum_{i=0}^N X_i=k | X_i \textrm{ are iid coins}\right)$

## Averages

In everyday life, if $$𝐱 = (x_1,…,x_N)$$ we have $\text{mean}(𝐱)=\bar{\mathbf x} = \frac{1}{n}\sum_i x_i$

## Using proportions

Now, if we count how many of each different value are there $n(x) = \textrm{number of times that }(x_i=x)$ Then we can write $\text{mean}(𝐱)=\bar{\mathbf x} =\sum_x x \frac{n(x)}{N}$

In other words, to calculate the average we need to know the proportions

## Expected value - Mean value

For any random variable $$X$$ we define the expected value (also called mean value) of $$X$$ as its average over the population $𝔼X=\sum x\, ℙ(X=x)$ Notice that $$X$$ is a random variable but $$𝔼X$$ is not.

Generalizing, we can get the expected value of any function of $$X$$ $𝔼\,f(X)=\sum f(x)\, ℙ(X=x)$

## Expected value is linear

If $$X$$ and $$Y$$ are two random variables, and $$\alpha$$ is a real number, then

$𝔼(X + Y)=𝔼X + 𝔼Y$ $𝔼(α X)=α\, 𝔼X$

So, if $$α$$ and $$β$$ are real numbers, then

$𝔼(α X +\beta Y)=α\, 𝔼X +β\, 𝔼Y$

Exercise: prove it yourself

## Variance of the population

The variance of the population is defined with the same idea as the sample variance $𝕍 X=𝔼(X-𝔼X)^2$ Notice that the variance has squared units

In most cases it is more comfortable to work with the standard deviation $$\sigma=\sqrt{𝕍X}.$$

In that case the population variance can be written as $$\sigma^2$$

## Simple formula for population variance

We can rewrite the variance of the population with a simpler formula: $𝕍X=𝔼(X-𝔼X)^2=𝔼(X^2)-(𝔼X)^2$ because $𝔼(X-𝔼X)^2=𝔼(X^2-2X𝔼X+(𝔼X)^2)\\=𝔼(X^2)-2𝔼(X𝔼X)+𝔼(𝔼X)^2$ but $$𝔼X$$ is a non-random number, so $$𝔼(X𝔼X)=(𝔼X)^2$$ and $$𝔼(𝔼X)^2=(𝔼X)^2$$

## Variance is almost linear

if $$X$$ and $$Y$$ are two independent random variables, and $$\alpha$$ is a real number, then

• $$𝕍(X + Y)=𝕍 X + 𝕍 Y$$
• $$𝕍(α X)=α^2 𝕍 X$$

To prove the first equation we use that $$𝔼(XY)=𝔼X\,𝔼Y,$$ which is true when $$X$$ is independent of $$Y$$