Homework 2

To improve our understanding of t-test and ANOVA in linear models, we can use simulation.

First, we need a function to create a data frame of random values. We can call it create_random_data(n). The input n indicates the number of rows. It must return a data frame with 3 columns called x1, x2, and y. The values should be chosen randomly following a Normal distribution with mean zero and variance 1.
Then we need a function that takes a data frame as input, and returns a vector with values taken from summary(lm(y ~ x1 + x2, data)). In particular we want to get:
- the coefficients predicted by the linear model. This is the first column of the field coefficients of the output of summary. They will be (Intercept), x1 and x2. I would like to call them \(β_0,β_1,β_2,\) or at least B0, B1, B2.’
- the t-values predicted by the linear model. This is the third column of the field coefficients of the output of summary. Let’s call them t0, t1, t2.
- The p-values predicted by the linear model. This is the fourth column of the field coefficients of the output of summary. Let’s call them p0, p1, p2.
- The F statistic and the degrees of freedom, taken from the field fstatistic of the output of summary. We call them f, df1, df2.
Now we want to make several hundreds of replicas of the full process of generating a random data frame, building a linear model on it, and getting the relevant parameters from the model. We can use n=3 initially. We collect all results in a data frame, one row for each simulation, one column for each of the 12 parameters.

We add an extra column pval, with the p-value for the F statistics. We need this calculation because summary() does not provide it for us.
Finally we plot B1 versus B2 using color depending on the significance of p1, p2, or pval, respectively. We can draw similar plots using t1 versus t2 for the \(x\) and \(y\) position.

We will discuss the results in classes.