To improve our understanding of t-test and ANOVA in linear models, we can use simulation.
First, we need a function to create a data frame of random values. We can call it
create_random_data(n). The inputnindicates the number of rows. It must return a data frame with 3 columns calledx1,x2, andy. The values should be chosen randomly following a Normal distribution with mean zero and variance 1.Then we need a function that takes a data frame as input, and returns a vector with values taken from
summary(lm(y ~ x1 + x2, data)). In particular we want to get:- the coefficients predicted by the linear model. This is the first
column of the field
coefficientsof the output of summary. They will be(Intercept),x1andx2. I would like to call them \(β_0,β_1,β_2,\) or at leastB0,B1,B2.’ - the t-values predicted by the linear model. This is the
third column of the field
coefficientsof the output of summary. Let’s call themt0,t1,t2. - The p-values predicted by the linear model. This is the
fourth column of the field
coefficientsof the output of summary. Let’s call themp0,p1,p2. - The F statistic and the degrees of freedom, taken from the field
fstatisticof the output of summary. We call themf,df1,df2.
- the coefficients predicted by the linear model. This is the first
column of the field
Now we want to make several hundreds of replicas of the full process of generating a random data frame, building a linear model on it, and getting the relevant parameters from the model. We can use
n=3initially. We collect all results in a data frame, one row for each simulation, one column for each of the 12 parameters.We add an extra column
pval, with the p-value for the F statistics. We need this calculation becausesummary()does not provide it for us.Finally we plot
B1versusB2using color depending on the significance ofp1,p2, orpval, respectively. We can draw similar plots usingt1versust2for the \(x\) and \(y\) position.
We will discuss the results in classes.