Modeling weight versus height and hand span
We want two models for weight_kg
Obviously, \(β_0\) and \(β_1\) on each mode l are different.
We can use SLOPE()
and INTERCEPT()
.
Besides SLOPE()
and INTERCEPT()
, there is
another Excel function for linear models
We can get both at the same time using LINEST()
This is an advanced function that returns several values
It has some optional parameters that we will study today
Let’s compare with the previous results
LINEST()
LINEST(known_data_y, known_data_x, [calculate_b], [verbose])
known_data_y
– range containing dependent
(y
) values
known_data_x
– The values of the independent
variables
calculate_b
– [ TRUE by default ] – Given a linear form
of y = m*x+b
, calculates the y-intercept (b) if TRUE.
Otherwise, forces b
to be 0 and only calculates the
m
values, i.e. forces the curve fit to pass through the
origin
verbose
– [ FALSE by default ] – A flag specifying
whether to return additional regression statistics
LINEST()
additional statisticsCoefficients of independent variable and the y-intercept
The standard error for each coefficient and the intercept,
The coefficient of determination (\(R^2\))
Standard error for the dependent variable values
The F statistic (for ANOVA)
The degrees of freedom
The regression sum of squares
The residual sum of squares
We have two ways to predict weight_kg
We want to combine them
\[\text{weight_kg} = β_0 + β_1⋅\text{hand_span} + β_2⋅\text{height_cm}\]
Now we have to find three parameters
What are the conditions for these parameters?
How can we
find them?
The formula for \((β_0, β_1, β_2)\) is complex. We will not teach it
Instead, we will use the LINEST()
function
This is the main advantage of LINEST()
LINEST()
can model using several independent
variables
With more time we could show that the regression coefficients are made by summing and scaling the data
The sum of independent things will eventually follow a Normal distribution
Moreover, we often assume that the noise of each observation is also Normal
Therefore the coefficients follow a Normal distribution
As in the previous class, we must estimate \(σ\) with the sample variance
(In this case, LINEST()
provides the sample standard
error)
Therefore, we can study the coefficients using the Student’s t-distribution
The first question that people asks is “which coefficients are non-zeo?”
We answer that question with a hypothesis test
We define what do we want to test, and what is the alternative \[\begin{aligned} H_0:& β_i = 0\\ H_a:& β_i ≠ 0\\ \end{aligned}\]
To evaluate this, we remember that \[T_i = \frac{β_i}{\text{stderr}(β_i)}\] follows a t-distribution with the degrees of freedom given by \[df = N - \text{number of coefficients}\]
We lose one degree of freedom for each coefficient
(remember that the mean is a linear model with one coefficient)
We do the regression and get a specific value \(t_i\)
Now we can evaluate \[ℙ(\text{abs}(t_i) ≥
\text{abs}(T_i) |H_0)\] using 2*(1-T.DIST(t))
(Alternative: we can use T.DIST.2T(t)
)
For a confidence level \((1-α)\) we
need to find \(k\) such that \[ℙ(\text{abs}(t)≤k⋅s) = (1-α)\] thus \(ℙ(\text{abs}(t)≥k⋅s) = α\) and \[ℙ(t≥k⋅s) = α/2\] We evaluate this using
T.INV(α/2, DF)
Calculate \(p\)-values and confidence intervals for all coefficients in the example regressions