In the previous post we spoke about the uncertainty on a fixed value, and how taking averages can increase precision. We can even estimate the uncertainty of the average.

The ideal case is when we measure a value that does not change. In
the case of the temperature we had the extra complication that it
changed during the experiment. We had two components of variability: the
noise and the change in temperature. So we had a trade-off^{1}
between increasing the sample size to reduce the noise, and decreasing
the sample size to reduce the change on the real value.

A conclusion of this analysis is that every sampling should include several measurements. In other words, instead of measuring temperature once every time, we can measure it twenty times on every sampling event. If the twenty measurements are taken in a short time —let’s say, less than a second— we get several independent measurements from a value that does not change —we assume— during that short period of time. Then we can average the twenty samples and get a temperature value with smaller uncertainty. We also get an estimation of that uncertainty.

# Linear models

For some events we can use another hypothesis. Sometimes, from
theoretical backgrounds, we can assume that the real value lays on a
straight line. For example, the following graph shows the logarithm of
the number of COVID-19 cases in Chile^{2}
until March 23, 2020.

Theory says that, in this phase and without major health care
interventions, the logarithm of the number of cases should follow a
straight line^{3}. Using this hypothesis, we can
describe the relationship between cases and time with the formula \[\log(\text{cases}_i)=a+b\cdot\text{time}_i +
\varepsilon_i\qquad\text{for each }i\]

The coefficients \(a\) and \(b\) are what we are looking for. The values \(\varepsilon_i\) represent the noise, that is, the variability due to randomness. As in the case of the moving average, we assume that the noise values are random, have mean zero, finite variance, and are independent. In this case we use linear regression —also know as least squares— to find the best values for \(a\) and \(b\).

We can do the regression in Google Sheets or Excel by drawing a plot
of our data, and then asking for the *trend line*. Even better,
we can use the function `=LINEST()`

to calculate the linear estimator^{4}. In the simple case, we
write `=LINEST(known_data_y, known_data_x)`

. By default, this
function will return the coefficients \(b\) and \(a\). We can then use the formula to predict
new values.

In R we perform the regression using the function `lm()`

,
which returns a *linear model*.

```
<-lm(log(cases) ~ day, data=covid)
model model
```

```
Call:
lm(formula = log(cases) ~ day, data = covid)
Coefficients:
(Intercept) day
0.3008 0.3176
```

Then we use the function `coef()`

to find the coefficients
of the model. The coefficient \(a\) is
named `(Intercept)`

, and the coefficient \(b\) is named `log(cases)`

, since
it multiplies the variable with the same name.

`coef(model)`

```
(Intercept) day
0.3007812 0.3176057
```

Finally, we can use the model to predict new values using the
function `predict()`

.

```
plot(log(cases)~day, data=covid, pch=16)
lines(predict(model) ~ day, data=covid)
```

There are two basic questions to be asked about the regression:

- What is the confidence interval for the coefficients \(a\) and \(b\)?
- What is the confidence interval for the prediction?

In the case of COVID-19, this question can be rewritten as

- What is the confidence interval for the growth rate of the number of cases?
- What is the confidence interval for the predicted number of cases?

A correct evaluation of these intervals will allow us to evaluate if the health-care decisions are effective or not. This saves lives.

# Calculating the intervals

It should not come as a surprise that the width of all these intervals depends on the noise in the data. More specifically, it depends on the variance of \(\varepsilon_i.\) As in the previous case, we do not really know this value but we can estimate it from the data. Using this estimated variance, and some formulas that we will discuss somewhere else, we can calculate the standard error of each coefficient.

In Google Sheets we can get the detailed regression using the same
function `=LINEST()`

but this time using the value
`TRUE`

for the option `verbose`

. We write
`=LINEST(known_data_y, known_data_x, FALSE, TRUE)`

^{5}. The output has several components.
For now we care only about the first two rows. They are the coefficients
\(b\) and \(a,\) and their standard errors \(s_b\) and \(s_a\). In the fourth row we also have the
value of *degrees of freedom*. In this case, since we have 21
rows of data and we calculated 2 coefficients, then we have 21-2=19
degrees of freedom.

The intervals are always between the extremes \(mean ± k\cdot stderr\). The value of \(k\) depends on the confidence level. We
calculate the value of \(k\) using
`=T.INV(1-alpha, df)`

, where `alpha`

is
`(1-confidence)/2`

and `df`

is the number of
degrees of freedom.

In R is easier. We just use the following code:

`confint(model, level=0.99)`

```
0.5 % 99.5 %
(Intercept) 0.03487745 0.5666849
day 0.29642922 0.3387823
```

Therefore we have a range that contains the real growth factor with probability 99%, given the data and the assumptions.[^ {-} This will be important in other applications, such as differential gene expression analysis.]

The meaning of these intervals is the following. Any value of the
coefficients in the interval is compatible with the data. The calculated
coefficients are just the center of the interval. Therefore, with 99%
confidence, the number of cases may be equally described by any equation
like this \[
\log(\text{cases}_i) =(a+k_a s_a)+(b+k_b s_b)\cdot\text{time}_i +
\varepsilon_i
\] as long as \(k_a\) and \(k_b\) are between -3 and 3. More precisely,
between `-T.INV(0.995, 19)`

and
`T.INV(0.995, 19)`

[^ {-} You can see an illustration on the
sheet *“pred1”*.].

In R it is even easier. If we use the function `predict()`

with the option `interval="confidence"`

, it will give us a
3–column matrix with the predicted center, lower, and upper value for
each day. The following plot shows the idea

```
<- predict(model, interval = "confidence", level = 0.99)
valid_models plot(log(cases)~day, covid, pch=16)
lines(valid_models[,"fit"])
lines(valid_models[,"lwr"])
lines(valid_models[,"upr"])
```

It may be easier to understand the result if we *“undo the
logarithm”*, that is, use the *exponent* function.^{6}

```
plot(cases ~ day, covid, pch=16, ylim=range(exp(valid_models)))
lines(exp(valid_models[,"fit"]))
lines(exp(valid_models[,"lwr"]))
lines(exp(valid_models[,"upr"]))
```

## Uncertainty in the prediction.

We just have shown that the noise in the data results in a range of possible coefficients, and therefore in a set of linear models that are compatible with the data. Now we want to predict number of cases for new conditions.

In general, linear models are good at *interpolating*, that
is, finding intermediate values *between* the data points. In
this case we are interested in *extrapolating*, that is, finding
the number of cases *outside* the data points, in the future. And
evaluate their uncertainty.

One part of the uncertainty comes form the coefficients’ uncertainty, as described in the previous section. Now we have another component of the uncertainty: the noise.

Each point is random. The linear equation only predicts the center of
the interval where the points can be. Then the actual width depends on
the *standard error of the residuals*^{7}.
Thus, the prediction interval has to combine both uncertainties.

I do not know if this can be done easily in Excel or Google Sheets.
In R it is easy. We need to create a data frame with a column named
`day`

and the values where we want to predict. This will be
the option `newdata`

for the `predict()`

function.
We also use the option `interval = "prediction"`

. As before,
we get a 3–column matrix with center, lower and upper limit of the
prediction confidence interval.

```
<- data.frame(day=0:25)
to_predict <- predict(model, newdata = to_predict,
prediction interval = "prediction", level = 0.99)
plot(log(cases)~day, covid, pch=16,
xlim=range(to_predict), ylim=range(prediction))
lines(prediction[,"fit"], lty=1)
lines(prediction[,"lwr"], lty=2)
lines(prediction[,"upr"], lty=2)
```

Again, it is easier to understand the result if we use the
*exponent* function to calculate the number of cases.^{8}

```
plot(cases~day, covid, pch=16,
xlim=range(to_predict), ylim=range(exp(prediction)))
lines(exp(prediction[,"fit"]), lty=1)
lines(exp(prediction[,"lwr"]), lty=2)
lines(exp(prediction[,"upr"]), lty=2)
```

# Application

We can use a linear model to determine a confidence interval for the
real delay between samples in the *Temperature test
experiment*.

Can you do this analysis? At least, describe how to do it

{-} This tradeoff can be described as the

*uncertainty principle*. Increasing the sample size tells us*what*was the temperature but not*when*it happened. We cannot know the value and time of a temperature with high precision at the same time.↩︎If you know a good source of data for other countries, I will be glad to know.↩︎

Later we will study the models that justify this theory.↩︎

Be careful with this function, since the answer uses several cells, not only the one where we write the formula. This is one of the

*array*functions of Google Sheets, and they do not work as the regular functions.↩︎{-} The meaning of the third parameter of

`LINEST()`

will be discussed later.↩︎{-} We can see that the last value does not fit in the model. Why?↩︎

The residuals are the difference between the fitted model and the real data. Under our hypotheses, these residuals are \(\varepsilon_i\).↩︎

{-} Now the last value does fit in the model. Why?↩︎