Class 36: Statistics of Linear Models

Methodology of Scientific Research

Andrés Aravena, PhD

1 June 2023

Continuation of Class 20

Modeling weight versus height and hand span

Two simple models

We want two models for weight_kg

One based in height_cm \[\text{weight_kg} = β_0 + β_1⋅\text{height_cm}\]
One based in hand_span \[\text{weight_kg} = β_0 + β_1⋅\text{hand_span}\]

Obviously, \(β_0\) and \(β_1\) on each mode l are different.
We can use SLOPE() and INTERCEPT().

Another Spreadsheet function

Besides SLOPE() and INTERCEPT(), there is another Excel function for linear models

We can get both at the same time using LINEST()

This is an advanced function that returns several values

It has some optional parameters that we will study today

Let’s compare with the previous results

Advanced use of `LINEST()`

LINEST(known_data_y, known_data_x, [calculate_b], [verbose])

known_data_y – range containing dependent (y) values

known_data_x – The values of the independent variables

calculate_b – [ TRUE by default ] – Given a linear form of y = m*x+b, calculates the y-intercept (b) if TRUE. Otherwise, forces b to be 0 and only calculates the m values, i.e. forces the curve fit to pass through the origin

verbose – [ FALSE by default ] – A flag specifying whether to return additional regression statistics

`LINEST()` additional statistics

Coefficients of independent variable and the y-intercept
The standard error for each coefficient and the intercept,
The coefficient of determination (\(R^2\))
Standard error for the dependent variable values
The F statistic (for ANOVA)
The degrees of freedom
The regression sum of squares
The residual sum of squares

Combining both models

We have two ways to predict weight_kg

We want to combine them

\[\text{weight_kg} = β_0 + β_1⋅\text{hand_span} + β_2⋅\text{height_cm}\]

Now we have to find three parameters

What are the conditions for these parameters?
How can we find them?

Finding all model’s parameters

The formula for \((β_0, β_1, β_2)\) is complex. We will not teach it

Instead, we will use the LINEST() function

This is the main advantage of LINEST()

LINEST() can model using several independent variables

Coefficients follow a Normal distribution

With more time we could show that the regression coefficients are made by summing and scaling the data

The sum of independent things will eventually follow a Normal distribution

Moreover, we often assume that the noise of each observation is also Normal

Therefore the coefficients follow a Normal distribution

We do not know \(σ\)

As in the previous class, we must estimate \(σ\) with the sample variance

(In this case, LINEST() provides the sample standard error)

Therefore, we can study the coefficients using the Student’s t-distribution

Hypothesis test

The first question that people asks is “which coefficients are non-zeo?”

We answer that question with a hypothesis test

We define what do we want to test, and what is the alternative \[\begin{aligned} H_0:& β_i = 0\\ H_a:& β_i ≠ 0\\ \end{aligned}\]

Coefficient estimators

To evaluate this, we remember that \[T_i = \frac{β_i}{\text{stderr}(β_i)}\] follows a t-distribution with the degrees of freedom given by \[df = N - \text{number of coefficients}\]

We lose one degree of freedom for each coefficient

(remember that the mean is a linear model with one coefficient)

\(p\)-value of each coefficient

We do the regression and get a specific value \(t_i\)

Now we can evaluate \[ℙ(\text{abs}(t_i) ≥ \text{abs}(T_i) |H_0)\] using 2*(1-T.DIST(t))

(Alternative: we can use T.DIST.2T(t))

Confidence intervals

For a confidence level \((1-α)\) we need to find \(k\) such that \[ℙ(\text{abs}(t)≤k⋅s) = (1-α)\] thus \(ℙ(\text{abs}(t)≥k⋅s) = α\) and \[ℙ(t≥k⋅s) = α/2\] We evaluate this using T.INV(α/2, DF)

Exercise

Calculate \(p\)-values and confidence intervals for all coefficients in the example regressions