Today we are going to analyze the relationship between the *population* of each country (the number of people living there), the *gross domestic product per capita* (GDP, the average income of each person in US dollars) and the *life expectancy* (the average length of life, in years). The data was produced by the well known research institute *Gapminder*, and is in a data frame called `world2007`

.

In this document the plots are side by side to save paper. If possible try to do the same in your answer, but it is not mandatory. All plots have the same contribution to the total grade. Each question’s value is shown in parenthesis. There are 15 points in total.

If you do not remember *logarithmic scales* you can use normal scales (with less grade), or you can read the manual.

Different countries have very different populations, so it is hard to visualize using a linear scale. It is better to analyze the logarithm (in base 10) of the population. Please draw two histograms with *grey* color and including a main title. The fist histogram shows the population, the second one shows `log10()`

of the population.

```
par(mfrow=c(1,2)) # this is 1 point
# histogram 3 points, +1 color, +1 main title
hist(world2007$population, col="grey", main="World Pop")
# log 3 points, +1 log10, +1 color, +1 main title
hist(log10(world2007$population), col="grey", main="log10(World Pop)")
```

Please draw three separate boxplots using data from `world2007`

. The first must show the *life expectancy* depending on the *continent*, the second shows the *GDP per capita* and the third shows the *population*. Notice that the first two boxplots use linear scale but the last one uses a logarithmic scale.

```
par(mfrow=c(1,3)) # 1 point
# plot 3 points, +2 main title
plot(life_exp ~ cont, data=world2007, main="Life Expectancy")
# plot 3 points, +2 main title
plot(gdp_cap ~ cont, data=world2007, main="GRP per capita")
# plot 3 points, +2 main title +2 log scale
plot(population ~ cont, data=world2007, main="Population", log="y") # log scale
```

Please plot the following graphic, showing the relationship between *continent* and *life expectancy*.

`plot(cont ~ life_exp, data=world2007, main="Continent v/s Life Expectancy")`

Please prepare three plots showing *life expectancy* depending on *GDP per capita*. The first plot uses linear scale. The second and third plots use log-log scale. All use the option `pch=21`

for the plot character. For the last plot, the symbol size (`cex=`

) should change with the population. Please create a new column called `world2007$lpop`

with the value `log10(world2007$population)-5`

and use it to change the symbol size on the third plot.

```
par(mfrow=c(1,3)) # 1 point
# plot 1 point, +1 pch=21
plot(life_exp ~ gdp_cap, data=world2007, pch=21) # plot:4 pch:1
# plot 2 points, +3 log scale, +1 pch=21
plot(life_exp ~ gdp_cap, data=world2007, pch=21, log="xy")
# assingment 2 points
world2007$lpop <- log10(world2007$population)-5
# plot 1 point, +1 pch=21, +1 cex=lpop, +2 log scale
plot(life_exp ~ gdp_cap, data=world2007, pch=21, log="xy", cex=lpop)
```

Now we will do the final graphic including a linear model. Please create a linear model, called `model`

, to find the relationship between the *life expectancy* and the *GDP per capita* in the logarithmic scale. Then plot *life expectancy* depending on *GDP per capita* with logarithmic scales, using character size equal to `lpop`

and *background color* (`bg=`

) depending on the continent. After the plot add points with the prediction of the model using `pch=23`

and background color “yellow”. Do not forget the legend, title and labels.

```
# any model 2 points, +1 log(life_exp), +1 log(gdp_cap)
model <- lm(log(life_exp) ~ log(gdp_cap), data=world2007)
# +1 bg=cont, +1 pch=21, +1 main, +1 xlab, +1 ylab
plot(life_exp ~ gdp_cap, data=world2007, log="xy", cex=lpop,
pch=21, bg=cont, main="The world at 2007", xlab="GDP per capita",
ylab="Life Expectancy")
# +1 points, +1 predict(...), +1 exp(...), +1 bg="yellow", +1 pch=23
points(exp(predict(model))~gdp_cap, data=world2007, bg="yellow", pch=23)
# +1 legend(...), +1 "topleft", +1 legend=, +1 fill
legend("topleft", legend=levels(world2007$cont), fill=1:5)
```

Since we have a model for the average *life expectancy* depending on *GDP per capita*, we can use it to predict. Please create a data frame called `yeni_data`

with a column called `gdp_cap`

. The values of `gdp_cap`

are 1000, 2000, 5000, 10000, 20000. Use `model`

and `yeni_data`

to predict the life expectancy. Put the predicted vaule on `yeni_data$life_exp`

and show the table. **Hint:** sometimes you need to use `exp()`

, sometimes you do not need.

```
# +1 yeni_data <- data.frame(...), +1 gdp_cap=c(...)
yeni_data <- data.frame(gdp_cap=c(1000, 2000, 5000, 10000, 20000))
# +1 assignment, +1 predict(...) +1 newdata=, +1 exp(...)
yeni_data$life_exp <- exp(predict(model, newdata=yeni_data))
# +1 showing yeni_data, +1 knitr::kable()
knitr::kable(yeni_data)
```

gdp_cap | life_exp |
---|---|

1000 | 54.21237 |

2000 | 58.64676 |

5000 | 65.07019 |

10000 | 70.39271 |

20000 | 76.15059 |

`# max score=6 `

Since continents are so different, we can create a new linear model, called `model2`

, with a formula where the logarithm of life expectancy depends on the continent, the interaction of the continent and the logarithm of the population, and without a constant “intercept”. Please write the code to create `model2`

. Then you can draw again the same plot as in question 5 (copy and paste the code) and add points with the prediction of `model2`

.

```
# +2 log(gdp_cap):cont, +2 "+cont"", +1 "+0""
model2 <- lm(log(life_exp) ~ log(gdp_cap):cont + cont + 0, data=world2007)
# plot is the same as Q06, does not give points
plot(life_exp~gdp_cap, data=world2007, log="xy", cex=lpop, pch=21, bg=cont,
main="The world at 2007", xlab="GDP per capita", ylab="Life Expectancy")
points(exp(predict(model2))~gdp_cap, data=world2007, bg="yellow", pch=23)
legend("topleft", legend=levels(world2007$cont), fill=1:5)
```

```
# Initially this question had 12 points, but half the question is the same
# as the previous one, so we only evaluated the first half. This was better for
# most of the students
```