Please download the answer file and edit it on Rstudio. Write your student number in the correct place at the beginning of the answer file. You should be able to Knit HTML and get the same results as the document you have in paper. Please do Knit often and verify that your document has no errors. If your document does not Knit, you will not have full grade.

When you finish, send the answers.Rmd file to my mailbox (andres.aravena+cmb@istanbul.edu.tr). Be sure to use the correct email address and send only one file.

IMPORTANT: Write your student number in the correct place at the beginning of the answer file.

Tidy up raw data

This week we will continue our work with the student data. Let’s start by downloading the data file from http://www.dry-lab.org/static/2020/cmb1/students2018-2020.tsv and storing it in our project folder.

Then we load the data into our R session using the following command.

library(readr)
students <- read_tsv("students2018-2020.tsv")

In class 18 we saw that the same city is written in different ways. That is bad for us, since it is hard to gather the totals. After some practice, we found that taking the first 3 letters of each city is enough to solve most of the cases. We also learned to use the function toupper() to change the letter case and allow us to compare lower- and upper-case letters at the same time.

Following this strategy, we create two auxiliary vectors to simplify our work.

valid_value <- !is.na(students$birthplace)
first_3_letters <- toupper(substr(students$birthplace, start = 1, stop=3))

Now we can correct each city one by one, with a command like this:

students$birthplace[ valid_value & first_3_letters =="VAN"] <- "Van/Turkey"

We can test the partial result using table()

table(substr(students$birthplace, start = 1, stop=3))

Ada Adı Afy Ale Alm Ank Ant Ayd Aze Bal Bat Bod Bur Cit Çor Edi Edr gaz Han Hat 
  1   1   3   2   2   3   3   1   1   1   1   1   3   1   1   1   1   1   1   2 
ist Ist İst izm İzm Kah Kır Kon Mal Man Mar Mer Muğ Nak Ord OSM Saf Sam Siv Süm 
  6  11  17   1   2   1   1   1   2   1   1   1   1   1   2   1   1   2   3   1 
Sur tek Tek Tun tur Tur Tür TÜr Van Yal YAL Yıl 
  2   1   3   1   2   4   2   1   3   1   2   2 

Complete the data tidying up

Write the code to clean up the survey data. Not all cases can be solved with this strategy since some values start with "Turkey". We solve these in the next part.

All values should be like "City/Country". No spaces, no comma, in English. If the city is not specified, write "-/Country". If the country is not specified, find it from the city name.

# Write here

Solve the remaining cases

When you finish the previous question, you will still have a few cases that cannot be solved by only looking at the first three letters. Please solve these cases now.

# Write here

Your code should work for the current data and also for new data that may come in the future.

How many people from each city?

Tell the computer to count how many students come from each city.

# Write here

(bonus) How many people form each country?

# Write here