# Model matrix for many factors

## So far the independent variables have been

• Numeric values (e.g. height)
• Factors (e.g. sex, age, diet, stress, tissue)
• Sum of numeric and factor (e.g. height + sex)
• Often we have an Intercept, but it is optional

## They are coded as matrices

The R command model.matrix transforms a formula into a matrix

Internally R uses it to prepare the linear model

The truth is that linear models only work with numbers, but we can represent other things with numbers

## Example with Intercept

We will use this example data

     sex height weight  hand
1   Male    179     67 Right
2 Female    168     55 Right
4   Male    170     74  Left
5 Female    162     68  Left

which can be modelled as this

model.matrix( ~ sex + height, data=students)|> as.data.frame()
  (Intercept) sexMale height
1           1       1    179
2           1       0    168
4           1       1    170
5           1       0    162

## Interpretation (reminder)

If the model is $y_i = β_0 + β_1 s_i + e_i$ where $$s_i$$ is 1 for Male and 0 for Female, then \begin{aligned} β_0 &= \text{mean}(Female)\\ β_1 &= \text{mean}(Male)-\text{mean}(Female) \end{aligned}

## Interpretation

If the model is $y_i = β_0 + β_1 s_i + β_2 h_i + e_i$ where $$h_i$$ is the weight of person $$i$$, then \begin{aligned} β_0 &= \text{baseline}(Female)\\ β_1 &= \text{baseline}(Male)-\text{baseline}(Female)\\ β_2 &= \text{slope}(Height) \end{aligned}

## Example without Intercept

Now we have independent Male and Female

model.matrix(~ sex + height + 0, data=students)|> as.data.frame()
  sexFemale sexMale height
1         0       1    179
2         1       0    168
4         0       1    170
5         1       0    162

## Interpretation

Now the model is $y_i = β_1 f_i + β_2 m_i + β_3 h_i + e_i$ where $$m_i$$ is 1 for Male and 0 for Female, $$f_i$$ is 1 for Female and 0 for Male, and $$h_i$$ is the weight, then \begin{aligned} β_1 &= \text{baseline}(Female)\\ β_2 &= \text{baseline}(Male)\\ β_3 &= \text{slope}(Height) \end{aligned}

## Not all combinations at the same time

Notice that we have either

• An intercept and an indicator for Male
• An indicator for Male and another for Female

But we cannot have the three at the same time

In that case the independent variables will be 100% correlated

If the model was $y_i = β_0 + β_1 f_i + β_2 m_i + e_i$ then, for any values $$0≤λ≤1$$ and $$0≤ρ≤1,$$ we have \begin{aligned} β_0 &= λ \text{baseline}(Female) + ρ \text{baseline}(Male)\\ β_1 &= (1-λ) \text{baseline}(Female)\\ β_2 &= (1-ρ) \text{baseline}(Male) \end{aligned} In other words, we cannot interpret the coefficients

## There are other combinations

Maybe the weight depends also on handedness

model.matrix(~ sex + hand + 0, data=students) |> as.data.frame()
  sexFemale sexMale handRight
1         0       1         1
2         1       0         1
4         0       1         0
5         1       0         0

Here we assume that hand is independent of sex

But what if they interact?

## Interactions

Maybe left-handed males are heavier

model.matrix(~ sex:hand + 0, data=students) |> as.data.frame()
  sexFemale:handLeft sexMale:handLeft sexFemale:handRight sexMale:handRight
1                  0                0                   0                 1
2                  0                0                   1                 0
4                  0                1                   0                 0
5                  1                0                   0                 0

## : means interaction

As we saw, the expression sex:hand creates four columns

• sexFemale:handLeft which is 1 when sex is Female and hand is Left
• sexMale:handLeft with the same idea
• sexFemale:handRight idem
• sexMale:handRight idem

## Interaction and sum

A common case is

model.matrix(~ sex:hand + sex + hand + 0, data=students) |> as.data.frame()
  sexFemale sexMale handRight sexMale:handRight
1         0       1         1                 1
2         1       0         1                 0
4         0       1         0                 0
5         1       0         0                 0

which can also be written as

model.matrix(~ sex*hand + 0, data=students) |> as.data.frame()
  sexFemale sexMale handRight sexMale:handRight
1         0       1         1                 1
2         1       0         1                 0
4         0       1         0                 0
5         1       0         0                 0

## Exercise

In this model, what is the interpretation of

• sexFemale
• sexMale
• handRight
• sexMale:handRight

## Combining factors and numeric

What about the interaction between sex and height

model.matrix(~ sex:height + 0, data=students) |> as.data.frame()
  sexFemale:height sexMale:height
1                0            179
2              168              0
4                0            170
5              162              0

What is the interpretation here?

## Interpretation

Now the model is $y_i = β_1 f_i h_i + β_2 m_i h_i + e_i$ where $$m_i$$ is 1 for Male and 0 for Female, $$f_i$$ is 1 for Female and 0 for Male, and $$h_i$$ is the weight, then \begin{aligned} β_1 &= \text{slope}(Height|Female)\\ β_2 &= \text{slope}(Height|Male) \end{aligned} It does not have intercept
It does not make too much sense (unless we center the data)

## Adding intercept for each sex

model.matrix(~ sex:height + sex + 0, data=students)|> as.data.frame()
  sexFemale sexMale sexFemale:height sexMale:height
1         0       1                0            179
2         1       0              168              0
4         0       1                0            170
5         1       0              162              0

What is the interpretation here?

## Summary

• Interactions create new variables in the linear model
• Chosen wisely, they will tell you what you want to know
• You can compare them later, using contrasts