Class 6: Sample v/s population

Systems Biology

Andrés Aravena, PhD

October 19, 2021

Models

All models are wrong …

… but some are useful

What do we want to model?

We want to understand (a part of) Nature

That is, a large (maybe infinite) number of cases

We start by describing the population

Populations and Samples

Maybe it looks like this

but we don’t know (yet)

We only have a sample

Sample v/s population

we do not know the whole population

and we want to know it

At least, the population mean \(\mu\) and variance \(\sigma^2\)

(we use Greek letters to represent population values)

Sample mean v/s population mean

The sample mean \(\overline{y}\) is a good predictor for the population mean \(\mu\)

We use the sample mean as a predictor of the population mean

Their probable difference is measured by the standard error of the mean

Math to the rescue

The Law of Large Numbers guarantees that, when the sample size \(n\) is big, then the sample mean \(\overline{y}\) is close to the population mean \(\mu\) \[\frac{1}{n}\sum_{i=1}^{n} y_i \rightarrow \mu\qquad\text{probably}\] but it does not say how to choose \(n\)

Important note!

This result is only valid if each element \(y_i\) of the sample is chosen randomly

In particular, all values must be independent

For example, it would be bad to take a sample only in a Sumo competition, or in a Basketball championship

More math to the rescue

The Central Limit Theorem says that, when the sample size \(n\) is big, then the sample mean \(\overline{y}\) follows a Normal distribution \[\frac{1}{n}\sum_{i=1}^{n} y_i ∼ 𝒩\left(μ, \frac{σ²}{n}\right)\] This is good. If we know \(\sigma\) then we can make a confidence interval that will probably contain \(\mu\)

And if we increase \(n,\) we get better precision

Confidence interval

We can rewrite the last expression as \[z=\frac{\left(\frac{1}{n}\sum_{i=1}^{n} y_i - μ\right)}{σ/\sqrt{n}}∼ 𝒩(0, 1)\]

And we can look in a table (or ask the computer) for the cutoff values for any significance level

In R

We need to write a formula in terms of \(\alpha\)

Let’s say we are looking for 95% confidence. We write \[1-\alpha=0.95\] Then we look for the value that yields \(\alpha/2\) in a Normal(0,1)

qnorm(0.05/2, lower.tail = FALSE)
[1] 1.959964

The interval \([\overline{y}-1.96{σ/\sqrt{n}}, \overline{y}+1.96{σ/\sqrt{n}}]\) contains \(\mu\) with probability 95%

Looking for the population variance

Unfortunately, we do not know the population variance \(\sigma\)

Can we estimate it? Maybe

\[\frac{1}{n}\sum_{i=1}^{n} (y_i -\mu)^2 \] but we do not know \(\mu\) either

Sample variance v/s population variance

We can only use values we get from the sample. And we already know that \(\overline{y}\) is close to \(\mu\)

So we can calculate the sample variance \[s_{n}^2=\frac{1}{n}\sum_{i=1}^{n} (y_i -\overline{y})^2 \] Does it work?

Sample variance v/s population variance

It almost works. We will find that, on average \[\frac{1}{n}\sum_{i=1}^{n} (y_i -\overline{y})^2 \approx\frac{n-1}{n}\sigma^2\] (this happens because we use the same data to get \(\overline{y}\) and \(s_n^2\))

Fixing it

To estimate the population variance we use \[s^2_{n-1}=\frac{1}{n-1}\sum_{i=1}^{n} (y_i -\overline{y})^2 \approx\sigma^2\] This is the expression used in most computer programs

It is not the sample variance. It is an estimation of the population variance based on the sample

Using it

Now instead of \[z=\frac{\left(\frac{1}{n}\sum_{i=1}^{n} y_i - μ\right)}{σ/\sqrt{n}}∼ 𝒩(0, 1)\] we have \[t=\frac{\left(\frac{1}{n}\sum_{i=1}^{n} y_i - μ\right)}{s_{n-1}/\sqrt{n}}∼ Student(n-1)\]

Cheating and paying for it

We cheated using the sample twice: for the mean and for the population variance estimation

The price we pay is to use Student instead of Normal

Application to Linear Models

Population v/s sample

We can fit a linear model to our sample \[y_i = \beta_0 + \sum_j \beta_j x_{ij} + e_i\] On one side, this describes the “average” relationship between \(x\) and \(y\) in the sample

On the other side, this can be used to estimate the “average” relationship between \(x\) and \(y\) in the population

Real v/s estimated coefficients

Let’s say that there is a real relationship in the population, and \[y_i = \beta_0 + \sum_j \beta_j x_{ij} + e_i \quad\text{in the population}\] is correct. Since we have a sample, we will get different values \[y_i = \hat{\beta}_0 + \sum_j \hat{\beta}_j x_{ij} + e_i \quad\text{in the sample}\] We say that each \(\hat{\beta}_j\) is an estimator of \(\beta_j\)

These coefficients are random

Since the samples are random, the \(\hat{\beta}_j\) coefficients will change

But they will follow a Student’s \(t\) distribution with some degrees of freedom \[\frac{\hat{\beta}_j - \beta_j}{\sigma_{\beta_j}}∼ Student(df)\]

We can answer questions

The basic question we want to answer is

what is the real value of \(\beta_j\)?

Now we can make a confidence interval for \(\beta_j\)

This can tell us if \(\beta_j≠0\) or if \(\beta_j>0\)

We just need to frame the question in the good way

Example

Let’s say we count the number of decaying cells in brain tissue

We want to know the effect of diet and time on that number

Time can be represented by number of weeks

Diet must be encoded with 1’s and 0’s

How to encode (simple case)

Let me use the simple case \(weight(sex)\)

We want two values. It can be \[\begin{aligned} \beta_0 &= mean(Female)\\ \beta_1 &= mean(Male) \end{aligned}\] but that does not answer the question “are men heavier than women”

Comparing Male and Female

Instead we want \[\begin{aligned} \beta_0 &= mean(Female)\\ \beta_1 &= mean(Male)-mean(Female) \end{aligned}\] and now we can ask “is \(\beta_1>0\)?”

(we build a confidence interval for that)

How to encode

The last system of equations can be written as \[\begin{aligned} \beta_0 &= mean(Female)\\ \beta_0 + \beta_1 &= mean(Male) \end{aligned}\] which, in matrix form, will be \[ \begin{pmatrix} 1 & 0\\ 1 & 1 \end{pmatrix} \begin{pmatrix} \beta_0 \\ \beta_1 \end{pmatrix}= \begin{pmatrix} mean(Female)\\ mean(Male) \end{pmatrix} \]

The design matrix

\[ \begin{pmatrix} 1 & 0\\ 1 & 1 \end{pmatrix} \begin{pmatrix} \beta_0 \\ \beta_1 \end{pmatrix}= \begin{pmatrix} mean(Female)\\ mean(Male) \end{pmatrix} \] means that, for all Female samples, we have \[ \begin{pmatrix} 1 & 0\\ ⋮ & ⋮ \\ 1 & 0 \end{pmatrix} \]

The design matrix (2)

and for Male individuals, we have \[ \begin{pmatrix} 1 & 1\\ ⋮ & ⋮ \\ 1 & 1 \end{pmatrix} \]

Homework

  • Write your data in Excel as we explained in the last class
  • Decide what questions you want to ask to the model
  • See if you can encode the questions in a design matrix
  • Refresh your memory about matrices (I will send you a text)