As usual, margin notes contain code and comments that are not essential to the discussion, but that can still be interesting.

We have seen before that linear models allow us to interpolate and
extrapolate from experimental data, and define confidence intervals for
these predictions. Moreover, we have seen that the coefficients of the
linear model can reveal useful information^{1},
with their corresponding confidence intervals.

All these models take one or more *independent* variables —the
parameters that we can control— and combine them to approximate one
*dependent* variable —the result of the experiment. In all the
cases we have seen before, the independent variables were numerical
values in a rational scale^{2}. The essential
assumption of a linear model is that the dependent variable is
proportional to the independent variables, except for some additive
constant. Thus, twice the input must result in twice the output. For
example, when we measured the sound speed, twice the distance resulted
in twice the echo delay.

In this article we will explore the case when some of the independent
variables are in a nominal scale and not in a ratio scale. Remember that
a nominal variable can only take one of a few possible values, called
*levels*. Examples of nominal variables are: color, country, sex,
or any other variable where the possible answers are known *a
priory*.

One advantage of using *factors* instead of *text*
variables is that factors use a *controlled vocabulary*, where
only the allowed words can be used. Therefore, there is a lower risk of
input-data errors. In the language R these nominal variables are called
*factors*. There is nothing equivalent in Excel.

# Modeling the price of Internet plans using linear models

Somehow, I have been using a lot of internet in the last month. So
I’m exploring alternatives of service plans. I found a table on Turk
Telekom website, that can be tidy up^{3}
into an Excel
spreadsheet that you can download. We can analyze the data directly
in Excel, or with R, or with any other tool. Interestingly, we can read
the spreadsheet directly into R using the command
`read_excel()`

from the library `readxl`

, as in
the following code.

```
library(readxl)
<- read_excel("internet_turk_telekom.xlsx") internet
```

The data frame `internet`

has 29 rows. They are not too
many, and we can see them in the following table

` internet`

package | speed | limit | price |
---|---|---|---|

fibernet | 100 | Limitsiz | 330 |

fibernet | 100 | 300GB | 200 |

fibernet | 100 | 200GB | 176 |

fibernet | 100 | 100GB | 157 |

fibernet | 50 | Limitsiz | 310 |

fibernet | 50 | 300GB | 190 |

fibernet | 50 | 200GB | 166 |

fibernet | 50 | 100GB | 147 |

fibernet | 35 | Limitsiz | 290 |

fibernet | 35 | 300GB | 187 |

fibernet | 35 | 200GB | 163 |

fibernet | 35 | 100GB | 142 |

fibernet | 24 | Limitsiz | 270 |

fibernet | 24 | 300GB | 185 |

fibernet | 24 | 200GB | 161 |

fibernet | 24 | 100GB | 137 |

ultranet | 16 | Limitsiz | 255 |

ultranet | 16 | 300GB | 183 |

ultranet | 16 | 200GB | 159 |

ultranet | 16 | 100GB | 135 |

net | 12 | Limitsiz | 254 |

net | 8 | Limitsiz | 252 |

net | 8 | 200GB | 155 |

net | 8 | 100GB | 132 |

net | 6 | Limitsiz | 200 |

net | 6 | 60GB | 100 |

net | 4 | 60GB | 95 |

net | 4 | 40GB | 90 |

net | 4 | 20GB | 85 |

To visualize the data, it is convenient to define a new column with
the symbol to plot on each case. We take the first letter of the
*limit* variable. In other words, *“L”* means
*Limitsiz* (Turkish for *“limitless”*), *“3”* means
*“300GB”*, and so on.

`$symbol <- substr(internet$limit, 1, 1) internet`

The full data set looks like the following plot.

```
plot(price~speed, internet, pch=symbol,
main="Raw Data")
```

We can see that, for a given limit, the price seems to follow a straight line. That will be our model.

## Tidying up the data

For the purpose of this article, we are going to focus only on the
cases where the speed is at least 8 Mbps, since the smaller cases seem
to have a different behavior. Thus, we only keep the subset of cases
where `speed`

≥8.

`<- subset(internet, speed>=8) internet `

We will drop the column `package`

, since we will not use
it for modeling in this article^{4}. Finally, we need to
transform the column *limit* from text type into a factor type,
that is, a categorical variable. This will allow us to use
*limit* as part of the linear model^{5}.

`<- transform(internet, limit=as.factor(limit), package=NULL) internet `

The new tidy-up table looks like this

` internet`

speed | limit | price | symbol |
---|---|---|---|

100 | Limitsiz | 330 | L |

100 | 300GB | 200 | 3 |

100 | 200GB | 176 | 2 |

100 | 100GB | 157 | 1 |

50 | Limitsiz | 310 | L |

50 | 300GB | 190 | 3 |

50 | 200GB | 166 | 2 |

50 | 100GB | 147 | 1 |

35 | Limitsiz | 290 | L |

35 | 300GB | 187 | 3 |

35 | 200GB | 163 | 2 |

35 | 100GB | 142 | 1 |

24 | Limitsiz | 270 | L |

24 | 300GB | 185 | 3 |

24 | 200GB | 161 | 2 |

24 | 100GB | 137 | 1 |

16 | Limitsiz | 255 | L |

16 | 300GB | 183 | 3 |

16 | 200GB | 159 | 2 |

16 | 100GB | 135 | 1 |

12 | Limitsiz | 254 | L |

8 | Limitsiz | 252 | L |

8 | 200GB | 155 | 2 |

8 | 100GB | 132 | 1 |

and the new graphics looks like this

```
plot(price~speed, internet, pch=symbol,
main="Filtered Data")
```

# Basic linear model with a numeric independent variable

Since the only numeric independent variable is *speed*, it is
natural to try a first model using *speed* to predict
*price*. In R, these relationships are written as
`price ~ speed`

, that we read *“price depends on
speed”*. We fit^{6} our first model with this code

`<- lm(price ~ speed, data=internet) model0 `

The model can also be fit in Excel or Google Sheets, using the
`=LINEST()`

function. You can see an example on the sheet
*model 0*, at the online
spreadsheet. Remember that `=LINEST()`

has an option to
return advanced statistical results, such as standard error and the
coefficient of determination *R²*.

The fitted model has the following parameters

`summary(model0)`

```
Call:
lm(formula = price ~ speed, data = internet)
Residuals:
Min 1Q Median 3Q Max
-59.03 -43.45 -23.66 64.30 113.97
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 185.3823 19.8484 9.340 4.12e-09 ***
speed 0.3064 0.4018 0.763 0.454
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 59.68 on 22 degrees of freedom
Multiple R-squared: 0.02576, Adjusted R-squared: -0.01853
F-statistic: 0.5816 on 1 and 22 DF, p-value: 0.4538
```

We can see that the main component is `(Intercept)`

, with
a very small 𝑝-value. On the other side, the 𝑝-value of
`speed`

is large, so we do not have any strong evidence that
*speed* has any effect on *price*. Moreover, the
*R²* value is small^{7}, which means that most
of the variance cannon be explained by *model 0*. We can see it
graphically here

```
plot(price~speed, internet, pch=symbol,
main="Model 0")
<- "#CC5450"
red points(predict(model0)~speed, internet,
col=red, type="o", lwd=2)
```

This initial model shows a small tendency to increase the price when the speed increases. But the real values are far away from the values predicted by the model.

# Model with a factor

The model using only *speed* did not fit well to the data. On
the other side we have a clear effect of *limit* in
*price*. Thus, we can build a new model expressing this
relationship as `price ~ limit`

. We can call this *model
1*, and read it like “*price* depends on *limit*”.

Since *limit* is a factor, we can fit the new model in R using
the same command as before

`<- lm(price ~ limit, data=internet) model1 `

This can also be done in a spreadsheet, as we will discuss later. The result now has several coefficients, as we can see below

`summary(model1)`

```
Call:
lm(formula = price ~ limit, data = internet)
Residuals:
Min 1Q Median 3Q Max
-28.143 -7.083 -2.167 6.464 49.857
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 141.667 7.387 19.179 2.40e-14 ***
limit200GB 21.667 10.446 2.074 0.051196 .
limit300GB 47.333 10.956 4.320 0.000333 ***
limitLimitsiz 138.476 10.066 13.756 1.17e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 18.09 on 20 degrees of freedom
Multiple R-squared: 0.9186, Adjusted R-squared: 0.9064
F-statistic: 75.22 on 3 and 20 DF, p-value: 4.554e-11
```

There is an *intercept* value and values for *200GB*,
*300GB*, and *Limitsiz*. There is nothing labeled
*100GB*. All these coefficients have small 𝑝-values, suggesting
that there is a significant effect of *limit* on
*price*.

The `R-squared`

value shows that *model 1* can
explain 91% of the variance^{8}. Much better than in
*model 0*.

Graphically, fitted *model 1* looks like this

```
plot(price~speed, internet, pch=symbol,
main="Model 1")
par(col=red, lwd=2)
points(predict(model1)~speed, internet,
type="o", subset=limit=="100GB")
points(predict(model1)~speed, internet,
type="o", subset=limit=="200GB")
points(predict(model1)~speed, internet,
type="o", subset=limit=="300GB")
points(predict(model1)~speed, internet,
type="o", subset=limit=="Limitsiz")
```

## Interpreting the coefficients

Let us look again the coefficients found using *limit* as the
independent variable in `model1`

. The coefficients are

`coef(model1)`

(Intercept) | limit200GB | limit300GB | limitLimitsiz |
---|---|---|---|

141.7 | 21.67 | 47.33 | 138.5 |

As we said before, there is an *intercept* value and values
for *200GB*, *300GB*, and *Limitsiz*, but nothing
for *100GB*. To understand these values we need to compare them
to the averages of *price* for each level of *limit*.

```
<- tapply(internet$price,
averages $limit, mean)
internet averages
```

100GB | 200GB | 300GB | Limitsiz |
---|---|---|---|

141.7 | 163.3 | 189 | 280.1 |

We can immediately see that the *intercept* value is equal to
the average for *100GB*, that is, the missing level. ith a little
arithmetic, we can notice that the other coefficients are the
*difference* between the *level average* and the
*intercept*.

\[ \begin{aligned} 163.3333333 & = 141.6666667 + 21.6666667 \\ 189 & = 141.6666667 + 47.3333333 \\ 280.1428571 & = 141.6666667 + 138.4761905 \\ \end{aligned} \]

In other words, the model coefficients are

- Intercept: average of the “first” category
- Other coefficients: difference between the first category and the other categories.

```
2] - averages[1] = coef(model1)[2]
averages[3] - averages[1] = coef(model1)[3]
averages[4] - averages[1] = coef(model1)[4] averages[
```

We have to keep in mind that the *first* level plays a special
role in these kind of models, different from other levels’ role.

## How to model *factors* in a linear model

We know from previous articles that linear models can take several independent variables and find the best linear combination to approximate a dependent variables. This idea can be extended to be used with factors. Each factor level is represented by an independent variable, that can take only 0 or 1 values.

This set of new variables is called *model matrix* or
*design matrix*. R does this automatically. We can see how it is
done in R using the function `model.matrix()`

and the same
formula used to describe the linear model. Indeed, this is done
internally when we fit the model, but will have other applications
later. The R code is this

`<- model.matrix( ~ limit, data=internet) design `

and the resulting model matrix is

` design`

(Intercept) | limit200GB | limit300GB | limitLimitsiz |
---|---|---|---|

1 | 0 | 0 | 1 |

1 | 0 | 1 | 0 |

1 | 1 | 0 | 0 |

1 | 0 | 0 | 0 |

1 | 0 | 0 | 1 |

1 | 0 | 1 | 0 |

1 | 1 | 0 | 0 |

1 | 0 | 0 | 0 |

1 | 0 | 0 | 1 |

1 | 0 | 1 | 0 |

1 | 1 | 0 | 0 |

1 | 0 | 0 | 0 |

1 | 0 | 0 | 1 |

1 | 0 | 1 | 0 |

1 | 1 | 0 | 0 |

1 | 0 | 0 | 0 |

1 | 0 | 0 | 1 |

1 | 0 | 1 | 0 |

1 | 1 | 0 | 0 |

1 | 0 | 0 | 0 |

1 | 0 | 0 | 1 |

1 | 0 | 0 | 1 |

1 | 1 | 0 | 0 |

1 | 0 | 0 | 0 |

The first column represents the *intercept*, and it is 1 in
all cases. The second column is 1 when *limit* is equal to
*200GB* and 0 otherwise. The third column is 1 only when
*limit* is equal to *300GB*, and the fourth column is 1
only when *limit* is *Limitsiz*.

We can see that there is no column for *100GB*. Instead, the
rows in the model matrix corresponding to *100GB* have 0 in all
columns, except of course on column *intercept*, which always has
1.

In Excel we can build the design matrix by first writing the level
names in the columns’ headers —let’s say, in cells `C1`

to
`E1`

—, and then using a formula like

`=IF($B2 = $C1, 1, 0)`

We write this formula on cell `C2`

and then we copy it to
all cell positions in the matrix. Here we assume that column
`B`

has the factor values. This approach can be seen online,
in the sheet *model 1*.

# Model with all categories

There is a technical reason^{9} that forbids us to have
all factor levels *and* the intercept at the same time. If we
want to include an explicit value for *100GB* category, then we
need to omit the *intercept*. Thus, we have a new model, called
*model 2*.

In Excel and Google Sheets, the function `=LINEST()`

has
an option “make `B=0`

” that really means “do not use
intercept”. If we use that option, we can include a new column for
*100GB*. You can see this model in the sheet *model 2* in
the online
spreadsheet.

In R we omit the intercept by writing `+0`

in the the
linear model formula. Then, *model 2* is described by the formula
`price ~ limit + 0`

. We fit *model 2* to the data with
the same R function as in the following code

`<- lm(price ~ limit + 0, data=internet) model2 `

which gives us these new coefficients

`summary(model2)`

```
Call:
lm(formula = price ~ limit + 0, data = internet)
Residuals:
Min 1Q Median 3Q Max
-28.143 -7.083 -2.167 6.464 49.857
Coefficients:
Estimate Std. Error t value Pr(>|t|)
limit100GB 141.667 7.387 19.18 2.40e-14 ***
limit200GB 163.333 7.387 22.11 1.58e-15 ***
limit300GB 189.000 8.092 23.36 5.48e-16 ***
limitLimitsiz 280.143 6.839 40.96 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 18.09 on 20 degrees of freedom
Multiple R-squared: 0.9935, Adjusted R-squared: 0.9923
F-statistic: 770.1 on 4 and 20 DF, p-value: < 2.2e-16
```

This time the linear model’s coefficients correspond one-to-one to
each category’s averages. Nevertheless, *R²* in *model 2*
is the same as in *model 1*, so they have the same
*explanatory power*. The predictions will be the same. The real
difference is on the **meaning** of the coefficients^{10}.

The model matrix is easy to find. Just one column for each level of
*limit*, cells filled with 1 when the value is equal to the
column name, and 0 otherwise. In case of doubt, it is easy to find this
matrix in R using the command
`model.matrix(~limit+0, data=internet)`

.

## Advantages

A reasonable question here is *“why don’t we just take the average
and skip all the linear model?”* Using a linear model has some
advantages.

The first advantage of using a linear model, is that we can use all
data to estimate the standard error. If we assume that all dependent
values are affected by the same kind of *noise*, a linear model
is better than evaluating case by case. If the hypothesis of *same
noise for all error* is valid, a linear model is a good way to
estimate it.

The second advantage is that we can evaluate the statistical
significance of coefficients representing *differences*. For
example, in *model 1* we saw that the average for *200GB*
was near 21.7₺ higher than the average for *100GB*, but that
difference is not statistically significative: the 𝑝-value was more than
5%.

Another advantage of using a linear model for factors is that they can be easily combined with regular numeric independent variables.

# Combined model

So far we have *model 0* with a small increasing slope for
*speed*, and *models 1* and *2* with a different
constant for each level of *limit*. We can combine them, by
declaring that the *price* depends on *speed* plus
*limit* and without *(Intercept)*. Thus, *model 3*
is represented by the formula
`price ~ speed + limit + 0`

.

The design matrix will contain one column for *speed* and
several columns for *limit*. You can see how to do it with Excel,
on the sheet *model 3* online. In R we just write the formula and
we get the fitted values

`<- lm(price ~ speed + limit + 0, data=internet) model3 `

`summary(model3)`

```
Call:
lm(formula = price ~ speed + limit + 0, data = internet)
Residuals:
Min 1Q Median 3Q Max
-17.079 -6.763 1.795 4.793 23.491
Coefficients:
Estimate Std. Error t value Pr(>|t|)
speed 0.42444 0.07969 5.326 3.85e-05 ***
limit100GB 125.18437 5.71069 21.921 5.97e-15 ***
limit200GB 146.85104 5.71069 25.715 3.16e-16 ***
limit300GB 169.90034 6.36411 26.697 < 2e-16 ***
limitLimitsiz 265.28757 5.24632 50.566 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.76 on 19 degrees of freedom
Multiple R-squared: 0.9974, Adjusted R-squared: 0.9967
F-statistic: 1465 on 5 and 19 DF, p-value: < 2.2e-16
```

This time the coefficient for speed is 0.42 ₺/Mbps, a value not too
different from 0.31 ₺/Mbps found in *model 0*. The difference is
in the 𝑝-value. In *model 0* the coefficient was not
significantly different from 0. In *model 3* the difference is
statistically significant. Moreover, the confidence interval for
`speed`

coefficient is narrower in *model 3*, and does
not include 0.

`confint(model0)["speed",]`

```
2.5 % 97.5 %
-0.5268568 1.1397308
```

`confint(model3)["speed",]`

```
2.5 % 97.5 %
0.2576540 0.5912197
```

The coefficients for each level of *limit* are smaller than in
*model 2*, since we discount the contribution of *speed*.
Their standard error is smaller, so we get narrower confidence
intervals. It is a win-win situation. We can see this in the following
plot.

```
plot(price~speed, data=internet, pch=symbol,
main="Model 2")
par(col=red, lwd=2)
points(predict(model3)~speed, data=internet,
type="o", subset=limit=="100GB")
points(predict(model3)~speed, data=internet,
type="o", subset=limit=="200GB")
points(predict(model3)~speed, data=internet,
type="o", subset=limit=="300GB")
points(predict(model3)~speed, data=internet,
type="o", subset=limit=="Limitsiz")
```

The *speed* coefficient shows the average slope, discounting
any influence from the *limit* variable. It is clear that, in
this model, the main factor affecting the *price* is the
*limit*, since the *speed* contribution is only
`round(coef(model3)["speed"],2)`

₺ for each Mbps.

This model assumes that the increase of *price* when
*speed* increases is the same for each level of *limit*.
This hypothesis should be tested.

# Model with Independent slopes

Finally, we can build a model with different *speed* slopes
for each level of *limit*. In other words, we need to specify
that the *speed* coefficient should **interact**
with *limit*. We can do that by including an interaction term. In
R we specify this interaction with the formula
`price ~ speed:limit + limit + 0`

. This literally means
“*price* depends on *speed* combined with *limit*,
plus something depending on the limit, and no *intercept*”. This
will be *model 4*.

In Excel we need to insert more columns, one for each level of
*limit*. Then the cells contain the value of *speed* only
in the column corresponding to the value of *limit*. This can be
easily done by multiplying each *speed* value with the columns
representing *limit*. You can see how this is done in sheet
*model 4* of the online document. Alternatively, you can use the
`model.matix()`

function of R, with the corresponding
formula.

We fit *model 4* to the data as usual,

`<- lm(price ~ speed:limit + limit + 0, data=internet) model4 `

and we get the following results

`summary(model4)`

```
Call:
lm(formula = price ~ speed:limit + limit + 0, data = internet)
Residuals:
Min 1Q Median 3Q Max
-9.3569 -1.2862 -0.0709 0.5093 16.1924
Coefficients:
Estimate Std. Error t value Pr(>|t|)
limit100GB 131.10500 3.92435 33.408 3.15e-16 ***
limit200GB 154.99689 3.92435 39.496 < 2e-16 ***
limit300GB 179.98375 4.80998 37.419 < 2e-16 ***
limitLimitsiz 248.25837 3.46430 71.662 < 2e-16 ***
speed:limit100GB 0.27197 0.07950 3.421 0.0035 **
speed:limit200GB 0.21467 0.07950 2.700 0.0158 *
speed:limit300GB 0.20036 0.08914 2.248 0.0391 *
speed:limitLimitsiz 0.91099 0.07543 12.077 1.87e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.935 on 16 degrees of freedom
Multiple R-squared: 0.9994, Adjusted R-squared: 0.9992
F-statistic: 3600 on 8 and 16 DF, p-value: < 2.2e-16
```

As expected, *R²* is better than in the previous models. The
coefficients for each *limit*’s level have narrower intervals.
Most interestingly, the *speed* slope for *“Limitsiz”* is
much higher than the slope for other levels of *limit*. Therefore
the hypothesis used in *model 3* was not valid.

The plot of this model is this.

```
plot(price~speed, internet, pch=symbol,
main="Model 4")
par(col=red, lwd=2)
points(predict(model4)~speed, data=internet,
type="o", subset=limit=="100GB")
points(predict(model4)~speed, data=internet,
type="o", subset=limit=="200GB")
points(predict(model4)~speed, data=internet,
type="o", subset=limit=="300GB")
points(predict(model4)~speed, data=internet,
type="o", subset=limit=="Limitsiz")
```

In theory we can find the same results if we do independent models
for each level of *limit*. The advantage of the combined model is
to use more samples to estimate the standard error, and therefore find
narrower confidence intervals for each coefficient and prediction. The
big assumption here is that the “noise” source is the same for each
case.

## Exercise

To simplify the explanation, here we modeled the raw values. Nevertheless, it seems that using a

log-logscale produces better alignments. Indeed, speed grows exponentially in our data, therefor it may be better to use`log(speed)`

as independent variable. Naturally, we cannot take logarithm of a factor, but we can consider`log(price)`

.Your challenge is to repeat this analysis but using

`log(speed)`

and`limit`

to predict`log(price)`

. You can choose to skipmodel 1ormodel 2, but not both.Then —and this is the important part— you should write a report explaining the

meaningof the results. The mathematics of this exercise is easy, and you can copy-and-paste the R code. The real value —thescience— is in the interpretation of the results.

Raw data is an ingredient, like salt and oil. Knowledge is the prepared dish that combines all relevant data into a delicious/relevant dish.↩︎

Remember that in a

*rational scale*we can say*a*is twice*b*, or*b*is half of*a*. In contrast, other scales can be*interval*based,*ordinal*, or*nominal*.↩︎The tidying up process is interesting. I may make a video later.↩︎

You may decide to use

`package`

as independent variable in other models.↩︎As we will see, R linear models can handle numeric or factor variables, but not text variables.↩︎

The model is given by the formula, since it describes the relationship among the variables. Then we

*fit*the model to the data, and we find the*coefficients*.You can find more details at Regression Analysis Tutorial and Examples (The Minitab Blog)↩︎

A good explanation of

*R²*and*Adjusted R²*can be found at Wikipedia.↩︎If we think a little, we can see that increasing the number of independent variables should always increase the coefficient of determination

*R*, since we have more^{2}*degrees of freedom*. That is why it is better to look at the*Adjusted R*, which balances this effect.↩︎^{2}The reason is that the columns must be linearly independent.↩︎

**This is important.**The value of the model is not always on the*numeric prediction*. The value can be in the*qualitative interpretation*of the coefficients.↩︎