… but some are useful
We want to understand (a part of) Nature
That is, a large (maybe infinite) number of cases
We start by describing the population
but we don’t know (yet)
we do not know the whole population
and we want to know it
At least, the population mean \(\mu\) and variance \(\sigma^2\)
(we use Greek letters to represent population values)
The sample mean \(\overline{y}\) is a good predictor for the population mean \(\mu\)
We use the sample mean as a predictor of the population mean
Their probable difference is measured by the standard error of the mean
The Law of Large Numbers guarantees that, when the sample size \(n\) is big, then the sample mean \(\overline{y}\) is close to the population mean \(\mu\) \[\frac{1}{n}\sum_{i=1}^{n} y_i \rightarrow \mu\qquad\text{probably}\] but it does not say how to choose \(n\)
This result is only valid if each element \(y_i\) of the sample is chosen randomly
In particular, all values must be independent
For example, it would be bad to take a sample only in a Sumo competition, or in a Basketball championship
The Central Limit Theorem says that, when the sample size \(n\) is big, then the sample mean \(\overline{y}\) follows a Normal distribution \[\frac{1}{n}\sum_{i=1}^{n} y_i ∼ 𝒩\left(μ, \frac{σ²}{n}\right)\] This is good. If we know \(\sigma\) then we can make a confidence interval that will probably contain \(\mu\)
And if we increase \(n,\) we get better precision
We can rewrite the last expression as \[z=\frac{\left(\frac{1}{n}\sum_{i=1}^{n} y_i - μ\right)}{σ/\sqrt{n}}∼ 𝒩(0, 1)\]
And we can look in a table (or ask the computer) for the cutoff values for any significance level
We need to write a formula in terms of \(\alpha\)
Let’s say we are looking for 95% confidence. We write \[1-\alpha=0.95\] Then we look for the value that yields \(\alpha/2\) in a Normal(0,1)
[1] 1.959964
The interval \([\overline{y}-1.96{σ/\sqrt{n}}, \overline{y}+1.96{σ/\sqrt{n}}]\) contains \(\mu\) with probability 95%
Unfortunately, we do not know the population variance \(\sigma\)
Can we estimate it? Maybe
\[\frac{1}{n}\sum_{i=1}^{n} (y_i -\mu)^2 \] but we do not know \(\mu\) either
We can only use values we get from the sample. And we already know that \(\overline{y}\) is close to \(\mu\)
So we can calculate the sample variance \[s_{n}^2=\frac{1}{n}\sum_{i=1}^{n} (y_i -\overline{y})^2 \] Does it work?
It almost works. We will find that, on average \[\frac{1}{n}\sum_{i=1}^{n} (y_i -\overline{y})^2 \approx\frac{n-1}{n}\sigma^2\] (this happens because we use the same data to get \(\overline{y}\) and \(s_n^2\))
To estimate the population variance we use \[s^2_{n-1}=\frac{1}{n-1}\sum_{i=1}^{n} (y_i -\overline{y})^2 \approx\sigma^2\] This is the expression used in most computer programs
It is not the sample variance. It is an estimation of the population variance based on the sample
Now instead of \[z=\frac{\left(\frac{1}{n}\sum_{i=1}^{n} y_i - μ\right)}{σ/\sqrt{n}}∼ 𝒩(0, 1)\] we have \[t=\frac{\left(\frac{1}{n}\sum_{i=1}^{n} y_i - μ\right)}{s_{n-1}/\sqrt{n}}∼ Student(n-1)\]
We cheated using the sample twice: for the mean and for the population variance estimation
The price we pay is to use Student instead of Normal
We can fit a linear model to our sample \[y_i = \beta_0 + \sum_j \beta_j x_{ij} + e_i\] On one side, this describes the “average” relationship between \(x\) and \(y\) in the sample
On the other side, this can be used to estimate the “average” relationship between \(x\) and \(y\) in the population
Let’s say that there is a real relationship in the population, and \[y_i = \beta_0 + \sum_j \beta_j x_{ij} + e_i \quad\text{in the population}\] is correct. Since we have a sample, we will get different values \[y_i = \hat{\beta}_0 + \sum_j \hat{\beta}_j x_{ij} + e_i \quad\text{in the sample}\] We say that each \(\hat{\beta}_j\) is an estimator of \(\beta_j\)
Since the samples are random, the \(\hat{\beta}_j\) coefficients will change
But they will follow a Student’s \(t\) distribution with some degrees of freedom \[\frac{\hat{\beta}_j - \beta_j}{\sigma_{\beta_j}}∼ Student(df)\]
The basic question we want to answer is
what is the real value of \(\beta_j\)?
Now we can make a confidence interval for \(\beta_j\)
This can tell us if \(\beta_j≠0\) or if \(\beta_j>0\)
We just need to frame the question in the good way
Let’s say we count the number of decaying cells in brain tissue
We want to know the effect of diet and time on that number
Time can be represented by number of weeks
Diet must be encoded with 1’s and 0’s
Let me use the simple case \(weight(sex)\)
We want two values. It can be \[\begin{aligned} \beta_0 &= mean(Female)\\ \beta_1 &= mean(Male) \end{aligned}\] but that does not answer the question “are men heavier than women”
Instead we want \[\begin{aligned} \beta_0 &= mean(Female)\\ \beta_1 &= mean(Male)-mean(Female) \end{aligned}\] and now we can ask “is \(\beta_1>0\)?”
(we build a confidence interval for that)
The last system of equations can be written as \[\begin{aligned} \beta_0 &= mean(Female)\\ \beta_0 + \beta_1 &= mean(Male) \end{aligned}\] which, in matrix form, will be \[ \begin{pmatrix} 1 & 0\\ 1 & 1 \end{pmatrix} \begin{pmatrix} \beta_0 \\ \beta_1 \end{pmatrix}= \begin{pmatrix} mean(Female)\\ mean(Male) \end{pmatrix} \]
\[ \begin{pmatrix} 1 & 0\\ 1 & 1 \end{pmatrix} \begin{pmatrix} \beta_0 \\ \beta_1 \end{pmatrix}= \begin{pmatrix} mean(Female)\\ mean(Male) \end{pmatrix} \] means that, for all Female samples, we have \[ \begin{pmatrix} 1 & 0\\ ⋮ & ⋮ \\ 1 & 0 \end{pmatrix} \]
and for Male individuals, we have \[ \begin{pmatrix} 1 & 1\\ ⋮ & ⋮ \\ 1 & 1 \end{pmatrix} \]