Modeling weight versus height and hand span

We want two models for **weight_kg**

- One based in
**height_cm**\[\text{weight_kg} = β_0 + β_1⋅\text{height_cm}\] - One based in
**hand_span**\[\text{weight_kg} = β_0 + β_1⋅\text{hand_span}\]

Obviously, \(β_0\) and \(β_1\) on each mode l are different.

We can use `SLOPE()`

and `INTERCEPT()`

.

Besides `SLOPE()`

and `INTERCEPT()`

, there is
another Excel function for linear models

We can get both at the same time using `LINEST()`

This is an advanced function that returns several values

It has some optional parameters that we will study today

Let’s compare with the previous results

`LINEST()`

`LINEST(known_data_y, known_data_x, [calculate_b], [verbose])`

`known_data_y`

– range containing dependent
(`y`

) values

`known_data_x`

– The values of the independent
variables

`calculate_b`

– [ TRUE by default ] – Given a linear form
of `y = m*x+b`

, calculates the y-intercept (b) if TRUE.
Otherwise, forces `b`

to be 0 and only calculates the
`m`

values, i.e. forces the curve fit to pass through the
origin

`verbose`

– [ FALSE by default ] – A flag specifying
whether to return additional regression statistics

`LINEST()`

additional statisticsCoefficients of independent variable and the y-intercept

The standard error for each coefficient and the intercept,

The coefficient of determination (\(R^2\))

Standard error for the dependent variable values

The F statistic (for ANOVA)

The degrees of freedom

The regression sum of squares

The residual sum of squares

We have two ways to predict **weight_kg**

We want to combine them

\[\text{weight_kg} = β_0 + β_1⋅\text{hand_span} + β_2⋅\text{height_cm}\]

Now we have to find three parameters

**What are the conditions for these parameters?How can we
find them?**

The formula for \((β_0, β_1, β_2)\) is complex. We will not teach it

Instead, we will use the `LINEST()`

function

This is the main advantage of `LINEST()`

`LINEST()`

can model using several independent
variables

With more time we could show that the regression coefficients are made by summing and scaling the data

The sum of independent things will eventually follow a Normal distribution

Moreover, we often assume that the *noise* of each observation
is also Normal

Therefore the coefficients follow a Normal distribution

As in the previous class, we must estimate \(σ\) with the sample variance

(In this case, `LINEST()`

provides the sample standard
error)

Therefore, we can study the coefficients using the Student’s t-distribution

The first question that people asks is “which coefficients are non-zeo?”

We answer that question with a hypothesis test

We define what do we want to test, and what is the alternative \[\begin{aligned} H_0:& β_i = 0\\ H_a:& β_i ≠ 0\\ \end{aligned}\]

To evaluate this, we remember that \[T_i = \frac{β_i}{\text{stderr}(β_i)}\] follows a t-distribution with the degrees of freedom given by \[df = N - \text{number of coefficients}\]

We lose one degree of freedom for each coefficient

(remember that the mean is a linear model with one coefficient)

We do the regression and get a specific value \(t_i\)

Now we can evaluate \[ℙ(\text{abs}(t_i) ≥
\text{abs}(T_i) |H_0)\] using `2*(1-T.DIST(t))`

(Alternative: we can use `T.DIST.2T(t)`

)

For a confidence level \((1-α)\) we
need to find \(k\) such that \[ℙ(\text{abs}(t)≤k⋅s) = (1-α)\] thus \(ℙ(\text{abs}(t)≥k⋅s) = α\) and \[ℙ(t≥k⋅s) = α/2\] We evaluate this using
`T.INV(α/2, DF)`

Calculate \(p\)-values and confidence intervals for all coefficients in the example regressions