Blog of Andrés Aravena
Methodology of Scientific Research:

Linear models with categorical factors

24 April 2020

As usual, margin notes contain code and comments that are not essential to the discussion, but that can still be interesting.

We have seen before that linear models allow us to interpolate and extrapolate from experimental data, and define confidence intervals for these predictions. Moreover, we have seen that the coefficients of the linear model can reveal useful information1, with their corresponding confidence intervals.

All these models take one or more independent variables —the parameters that we can control— and combine them to approximate one dependent variable —the result of the experiment. In all the cases we have seen before, the independent variables were numerical values in a rational scale2. The essential assumption of a linear model is that the dependent variable is proportional to the independent variables, except for some additive constant. Thus, twice the input must result in twice the output. For example, when we measured the sound speed, twice the distance resulted in twice the echo delay.

In this article we will explore the case when some of the independent variables are in a nominal scale and not in a ratio scale. Remember that a nominal variable can only take one of a few possible values, called levels. Examples of nominal variables are: color, country, sex, or any other variable where the possible answers are known a priory.

One advantage of using factors instead of text variables is that factors use a controlled vocabulary, where only the allowed words can be used. Therefore, there is a lower risk of input-data errors. In the language R these nominal variables are called factors. There is nothing equivalent in Excel.

Modeling the price of Internet plans using linear models

Somehow, I have been using a lot of internet in the last month. So I’m exploring alternatives of service plans. I found a table on Turk Telekom website, that can be tidy up3 into an Excel spreadsheet that you can download. We can analyze the data directly in Excel, or with R, or with any other tool. Interestingly, we can read the spreadsheet directly into R using the command read_excel() from the library readxl, as in the following code.

library(readxl)
internet <- read_excel("internet_turk_telekom.xlsx")

The data frame internet has 29 rows. They are not too many, and we can see them in the following table

internet
Price of Domestic Internet Plans, on April 2020
package speed limit price
fibernet 100 Limitsiz 330
fibernet 100 300GB 200
fibernet 100 200GB 176
fibernet 100 100GB 157
fibernet 50 Limitsiz 310
fibernet 50 300GB 190
fibernet 50 200GB 166
fibernet 50 100GB 147
fibernet 35 Limitsiz 290
fibernet 35 300GB 187
fibernet 35 200GB 163
fibernet 35 100GB 142
fibernet 24 Limitsiz 270
fibernet 24 300GB 185
fibernet 24 200GB 161
fibernet 24 100GB 137
ultranet 16 Limitsiz 255
ultranet 16 300GB 183
ultranet 16 200GB 159
ultranet 16 100GB 135
net 12 Limitsiz 254
net 8 Limitsiz 252
net 8 200GB 155
net 8 100GB 132
net 6 Limitsiz 200
net 6 60GB 100
net 4 60GB 95
net 4 40GB 90
net 4 20GB 85

To visualize the data, it is convenient to define a new column with the symbol to plot on each case. We take the first letter of the limit variable. In other words, “L” means Limitsiz (Turkish for “limitless”), “3” means “300GB”, and so on.

internet$symbol <- substr(internet$limit, 1, 1)

The full data set looks like the following plot.

plot(price~speed, internet, pch=symbol,
     main="Raw Data")
Figure 1. Price of Internet access plans for different download speeds. Symbols represent the limit of each plan.

We can see that, for a given limit, the price seems to follow a straight line. That will be our model.

Tidying up the data

For the purpose of this article, we are going to focus only on the cases where the speed is at least 8 Mbps, since the smaller cases seem to have a different behavior. Thus, we only keep the subset of cases where speed≥8.

internet <- subset(internet, speed>=8)

We will drop the column package, since we will not use it for modeling in this article4. Finally, we need to transform the column limit from text type into a factor type, that is, a categorical variable. This will allow us to use limit as part of the linear model5.

internet <- transform(internet, limit=as.factor(limit), package=NULL)

The new tidy-up table looks like this

internet
Filtered data used for modeling.
speed limit price symbol
100 Limitsiz 330 L
100 300GB 200 3
100 200GB 176 2
100 100GB 157 1
50 Limitsiz 310 L
50 300GB 190 3
50 200GB 166 2
50 100GB 147 1
35 Limitsiz 290 L
35 300GB 187 3
35 200GB 163 2
35 100GB 142 1
24 Limitsiz 270 L
24 300GB 185 3
24 200GB 161 2
24 100GB 137 1
16 Limitsiz 255 L
16 300GB 183 3
16 200GB 159 2
16 100GB 135 1
12 Limitsiz 254 L
8 Limitsiz 252 L
8 200GB 155 2
8 100GB 132 1

and the new graphics looks like this

plot(price~speed, internet, pch=symbol,
     main="Filtered Data")
Figure 2. Price for different speeds and limits, for speeds at least 8Mbps

Basic linear model with a numeric independent variable

Since the only numeric independent variable is speed, it is natural to try a first model using speed to predict price. In R, these relationships are written as price ~ speed, that we read “price depends on speed”. We fit6 our first model with this code

model0 <- lm(price ~ speed, data=internet)

The model can also be fit in Excel or Google Sheets, using the =LINEST() function. You can see an example on the sheet model 0, at the online spreadsheet. Remember that =LINEST() has an option to return advanced statistical results, such as standard error and the coefficient of determination .

The fitted model has the following parameters

summary(model0)
Call:
lm(formula = price ~ speed, data = internet)

Residuals:
   Min     1Q Median     3Q    Max 
-59.03 -43.45 -23.66  64.30 113.97 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 185.3823    19.8484   9.340 4.12e-09 ***
speed         0.3064     0.4018   0.763    0.454    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 59.68 on 22 degrees of freedom
Multiple R-squared:  0.02576,   Adjusted R-squared:  -0.01853 
F-statistic: 0.5816 on 1 and 22 DF,  p-value: 0.4538

We can see that the main component is (Intercept), with a very small 𝑝-value. On the other side, the 𝑝-value of speed is large, so we do not have any strong evidence that speed has any effect on price. Moreover, the value is small7, which means that most of the variance cannon be explained by model 0. We can see it graphically here

plot(price~speed, internet, pch=symbol,
     main="Model 0")
red <- "#CC5450"
points(predict(model0)~speed, internet,
        col=red, type="o", lwd=2)
Figure 3. Price predicted by model 0, compared to real data. The only independent variable is speed. Fitting is not good.

This initial model shows a small tendency to increase the price when the speed increases. But the real values are far away from the values predicted by the model.

Model with a factor

The model using only speed did not fit well to the data. On the other side we have a clear effect of limit in price. Thus, we can build a new model expressing this relationship as price ~ limit. We can call this model 1, and read it like “price depends on limit”.

Since limit is a factor, we can fit the new model in R using the same command as before

model1 <- lm(price ~ limit, data=internet)

This can also be done in a spreadsheet, as we will discuss later. The result now has several coefficients, as we can see below

summary(model1)
Call:
lm(formula = price ~ limit, data = internet)

Residuals:
    Min      1Q  Median      3Q     Max 
-28.143  -7.083  -2.167   6.464  49.857 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    141.667      7.387  19.179 2.40e-14 ***
limit200GB      21.667     10.446   2.074 0.051196 .  
limit300GB      47.333     10.956   4.320 0.000333 ***
limitLimitsiz  138.476     10.066  13.756 1.17e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18.09 on 20 degrees of freedom
Multiple R-squared:  0.9186,    Adjusted R-squared:  0.9064 
F-statistic: 75.22 on 3 and 20 DF,  p-value: 4.554e-11

There is an intercept value and values for 200GB, 300GB, and Limitsiz. There is nothing labeled 100GB. All these coefficients have small 𝑝-values, suggesting that there is a significant effect of limit on price.

The R-squared value shows that model 1 can explain 91% of the variance8. Much better than in model 0.

Graphically, fitted model 1 looks like this

plot(price~speed, internet, pch=symbol,
     main="Model 1")
par(col=red, lwd=2)
points(predict(model1)~speed, internet,
  type="o", subset=limit=="100GB")
points(predict(model1)~speed, internet,
 type="o", subset=limit=="200GB")
points(predict(model1)~speed, internet,
  type="o", subset=limit=="300GB")
points(predict(model1)~speed, internet,
  type="o", subset=limit=="Limitsiz")
Figure 4. Price predicted by model 1, compared to real data. The only independent variable is limit. Model 1 fits better than model 0.

Interpreting the coefficients

Let us look again the coefficients found using limit as the independent variable in model1. The coefficients are

coef(model1)
(Intercept) limit200GB limit300GB limitLimitsiz
141.7 21.67 47.33 138.5

As we said before, there is an intercept value and values for 200GB, 300GB, and Limitsiz, but nothing for 100GB. To understand these values we need to compare them to the averages of price for each level of limit.

averages <- tapply(internet$price,
                   internet$limit, mean)
averages
100GB 200GB 300GB Limitsiz
141.7 163.3 189 280.1

We can immediately see that the intercept value is equal to the average for 100GB, that is, the missing level. ith a little arithmetic, we can notice that the other coefficients are the difference between the level average and the intercept.

\[ \begin{aligned} 163.3333333 & = 141.6666667 + 21.6666667 \\ 189 & = 141.6666667 + 47.3333333 \\ 280.1428571 & = 141.6666667 + 138.4761905 \\ \end{aligned} \]

In other words, the model coefficients are

  • Intercept: average of the “first” category
  • Other coefficients: difference between the first category and the other categories.
averages[2] - averages[1] = coef(model1)[2]
averages[3] - averages[1] = coef(model1)[3]
averages[4] - averages[1] = coef(model1)[4]

We have to keep in mind that the first level plays a special role in these kind of models, different from other levels’ role.

How to model factors in a linear model

We know from previous articles that linear models can take several independent variables and find the best linear combination to approximate a dependent variables. This idea can be extended to be used with factors. Each factor level is represented by an independent variable, that can take only 0 or 1 values.

This set of new variables is called model matrix or design matrix. R does this automatically. We can see how it is done in R using the function model.matrix() and the same formula used to describe the linear model. Indeed, this is done internally when we fit the model, but will have other applications later. The R code is this

design <- model.matrix( ~ limit, data=internet)

and the resulting model matrix is

design
(Intercept) limit200GB limit300GB limitLimitsiz
1 0 0 1
1 0 1 0
1 1 0 0
1 0 0 0
1 0 0 1
1 0 1 0
1 1 0 0
1 0 0 0
1 0 0 1
1 0 1 0
1 1 0 0
1 0 0 0
1 0 0 1
1 0 1 0
1 1 0 0
1 0 0 0
1 0 0 1
1 0 1 0
1 1 0 0
1 0 0 0
1 0 0 1
1 0 0 1
1 1 0 0
1 0 0 0

The first column represents the intercept, and it is 1 in all cases. The second column is 1 when limit is equal to 200GB and 0 otherwise. The third column is 1 only when limit is equal to 300GB, and the fourth column is 1 only when limit is Limitsiz.

We can see that there is no column for 100GB. Instead, the rows in the model matrix corresponding to 100GB have 0 in all columns, except of course on column intercept, which always has 1.

In Excel we can build the design matrix by first writing the level names in the columns’ headers —let’s say, in cells C1 to E1—, and then using a formula like

=IF($B2 = $C1, 1, 0)

We write this formula on cell C2 and then we copy it to all cell positions in the matrix. Here we assume that column B has the factor values. This approach can be seen online, in the sheet model 1.

Model with all categories

There is a technical reason9 that forbids us to have all factor levels and the intercept at the same time. If we want to include an explicit value for 100GB category, then we need to omit the intercept. Thus, we have a new model, called model 2.

In Excel and Google Sheets, the function =LINEST() has an option “make B=0” that really means “do not use intercept”. If we use that option, we can include a new column for 100GB. You can see this model in the sheet model 2 in the online spreadsheet.

In R we omit the intercept by writing +0 in the the linear model formula. Then, model 2 is described by the formula price ~ limit + 0. We fit model 2 to the data with the same R function as in the following code

model2 <- lm(price ~ limit + 0, data=internet)

which gives us these new coefficients

summary(model2)
Call:
lm(formula = price ~ limit + 0, data = internet)

Residuals:
    Min      1Q  Median      3Q     Max 
-28.143  -7.083  -2.167   6.464  49.857 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
limit100GB     141.667      7.387   19.18 2.40e-14 ***
limit200GB     163.333      7.387   22.11 1.58e-15 ***
limit300GB     189.000      8.092   23.36 5.48e-16 ***
limitLimitsiz  280.143      6.839   40.96  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18.09 on 20 degrees of freedom
Multiple R-squared:  0.9935,    Adjusted R-squared:  0.9923 
F-statistic: 770.1 on 4 and 20 DF,  p-value: < 2.2e-16

This time the linear model’s coefficients correspond one-to-one to each category’s averages. Nevertheless, in model 2 is the same as in model 1, so they have the same explanatory power. The predictions will be the same. The real difference is on the meaning of the coefficients10.

The model matrix is easy to find. Just one column for each level of limit, cells filled with 1 when the value is equal to the column name, and 0 otherwise. In case of doubt, it is easy to find this matrix in R using the command model.matrix(~limit+0, data=internet).

Advantages

A reasonable question here is “why don’t we just take the average and skip all the linear model?” Using a linear model has some advantages.

The first advantage of using a linear model, is that we can use all data to estimate the standard error. If we assume that all dependent values are affected by the same kind of noise, a linear model is better than evaluating case by case. If the hypothesis of same noise for all error is valid, a linear model is a good way to estimate it.

The second advantage is that we can evaluate the statistical significance of coefficients representing differences. For example, in model 1 we saw that the average for 200GB was near 21.7₺ higher than the average for 100GB, but that difference is not statistically significative: the 𝑝-value was more than 5%.

Another advantage of using a linear model for factors is that they can be easily combined with regular numeric independent variables.

Combined model

So far we have model 0 with a small increasing slope for speed, and models 1 and 2 with a different constant for each level of limit. We can combine them, by declaring that the price depends on speed plus limit and without (Intercept). Thus, model 3 is represented by the formula price ~ speed + limit + 0.

The design matrix will contain one column for speed and several columns for limit. You can see how to do it with Excel, on the sheet model 3 online. In R we just write the formula and we get the fitted values

model3 <- lm(price ~ speed + limit + 0, data=internet)
summary(model3)
Call:
lm(formula = price ~ speed + limit + 0, data = internet)

Residuals:
    Min      1Q  Median      3Q     Max 
-17.079  -6.763   1.795   4.793  23.491 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
speed           0.42444    0.07969   5.326 3.85e-05 ***
limit100GB    125.18437    5.71069  21.921 5.97e-15 ***
limit200GB    146.85104    5.71069  25.715 3.16e-16 ***
limit300GB    169.90034    6.36411  26.697  < 2e-16 ***
limitLimitsiz 265.28757    5.24632  50.566  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.76 on 19 degrees of freedom
Multiple R-squared:  0.9974,    Adjusted R-squared:  0.9967 
F-statistic:  1465 on 5 and 19 DF,  p-value: < 2.2e-16

This time the coefficient for speed is 0.42 ₺/Mbps, a value not too different from 0.31 ₺/Mbps found in model 0. The difference is in the 𝑝-value. In model 0 the coefficient was not significantly different from 0. In model 3 the difference is statistically significant. Moreover, the confidence interval for speed coefficient is narrower in model 3, and does not include 0.

confint(model0)["speed",]
     2.5 %     97.5 % 
-0.5268568  1.1397308 
confint(model3)["speed",]
    2.5 %    97.5 % 
0.2576540 0.5912197 

The coefficients for each level of limit are smaller than in model 2, since we discount the contribution of speed. Their standard error is smaller, so we get narrower confidence intervals. It is a win-win situation. We can see this in the following plot.

plot(price~speed, data=internet, pch=symbol,
     main="Model 2")
par(col=red, lwd=2)
points(predict(model3)~speed, data=internet,
  type="o", subset=limit=="100GB")
points(predict(model3)~speed, data=internet,
  type="o", subset=limit=="200GB")
points(predict(model3)~speed, data=internet,
  type="o", subset=limit=="300GB")
points(predict(model3)~speed, data=internet,
  type="o", subset=limit=="Limitsiz")
Figure 5. Price predicted by model 3, compared to real data. There are two independent variables: speed and limit. Fitting is better than all previous model.

The speed coefficient shows the average slope, discounting any influence from the limit variable. It is clear that, in this model, the main factor affecting the price is the limit, since the speed contribution is only round(coef(model3)["speed"],2) ₺ for each Mbps.

This model assumes that the increase of price when speed increases is the same for each level of limit. This hypothesis should be tested.

Model with Independent slopes

Finally, we can build a model with different speed slopes for each level of limit. In other words, we need to specify that the speed coefficient should interact with limit. We can do that by including an interaction term. In R we specify this interaction with the formula price ~ speed:limit + limit + 0. This literally means “price depends on speed combined with limit, plus something depending on the limit, and no intercept”. This will be model 4.

In Excel we need to insert more columns, one for each level of limit. Then the cells contain the value of speed only in the column corresponding to the value of limit. This can be easily done by multiplying each speed value with the columns representing limit. You can see how this is done in sheet model 4 of the online document. Alternatively, you can use the model.matix() function of R, with the corresponding formula.

We fit model 4 to the data as usual,

model4 <- lm(price ~ speed:limit + limit + 0, data=internet)

and we get the following results

summary(model4)
Call:
lm(formula = price ~ speed:limit + limit + 0, data = internet)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.3569 -1.2862 -0.0709  0.5093 16.1924 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
limit100GB          131.10500    3.92435  33.408 3.15e-16 ***
limit200GB          154.99689    3.92435  39.496  < 2e-16 ***
limit300GB          179.98375    4.80998  37.419  < 2e-16 ***
limitLimitsiz       248.25837    3.46430  71.662  < 2e-16 ***
speed:limit100GB      0.27197    0.07950   3.421   0.0035 ** 
speed:limit200GB      0.21467    0.07950   2.700   0.0158 *  
speed:limit300GB      0.20036    0.08914   2.248   0.0391 *  
speed:limitLimitsiz   0.91099    0.07543  12.077 1.87e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.935 on 16 degrees of freedom
Multiple R-squared:  0.9994,    Adjusted R-squared:  0.9992 
F-statistic:  3600 on 8 and 16 DF,  p-value: < 2.2e-16

As expected, is better than in the previous models. The coefficients for each limit’s level have narrower intervals. Most interestingly, the speed slope for “Limitsiz” is much higher than the slope for other levels of limit. Therefore the hypothesis used in model 3 was not valid.

The plot of this model is this.

plot(price~speed, internet, pch=symbol,
     main="Model 4")
par(col=red, lwd=2)
points(predict(model4)~speed, data=internet,
  type="o", subset=limit=="100GB")
points(predict(model4)~speed, data=internet,
  type="o", subset=limit=="200GB")
points(predict(model4)~speed, data=internet,
  type="o", subset=limit=="300GB")
points(predict(model4)~speed, data=internet,
  type="o", subset=limit=="Limitsiz")
Figure 6. Price predicted by model 4, compared to real data. Each category of limit has its own slope, given by the interaction element speed:limit. Fitting is better than all previous model.

In theory we can find the same results if we do independent models for each level of limit. The advantage of the combined model is to use more samples to estimate the standard error, and therefore find narrower confidence intervals for each coefficient and prediction. The big assumption here is that the “noise” source is the same for each case.

Exercise

To simplify the explanation, here we modeled the raw values. Nevertheless, it seems that using a log-log scale produces better alignments. Indeed, speed grows exponentially in our data, therefor it may be better to use log(speed) as independent variable. Naturally, we cannot take logarithm of a factor, but we can consider log(price).

Your challenge is to repeat this analysis but using log(speed) and limit to predict log(price). You can choose to skip model 1 or model 2, but not both.

Then —and this is the important part— you should write a report explaining the meaning of the results. The mathematics of this exercise is easy, and you can copy-and-paste the R code. The real value —the science— is in the interpretation of the results.