---
author: "Write your name here"
number: STUDENT_NUMBER
title: "Homework 4"
subtitle: "Computing in Molecular Biology 1 – Molecular Biology and Genetics Department"
description: "Rehearsal for Midterm Exam"
date: "November 25, 2020"
output:
html_document:
number_sections: false
self_contained: false
editor_options:
chunk_output_type: inline
---
# Tidy up raw data
This week we will continue our work with the student data. Let's start by downloading the data file from
and storing it in our project folder.
Then we load the data into our R session using the following command.
```{r message=FALSE}
library(readr)
students <- read_tsv("students2018-2020.tsv")
```
In class 18 we saw that the same city is written in different ways. That is bad for us, since it is hard to gather the totals.
After some practice, we found that taking the first 3 letters of each city is enough to solve most of the cases.
We also learned to use the function `toupper()` to change the letter case and allow us to compare lower- and upper-case letters at the same time.
Following this strategy, we create two auxiliary vectors to simplify our work.
```{r}
valid_value <- !is.na(students$birthplace)
first_3_letters <- toupper(substr(students$birthplace, start = 1, stop=3))
```
Now we can correct each city one by one, with a command like this:
```{r}
students$birthplace[ valid_value & first_3_letters =="VAN"] <- "Van/Turkey"
```
We can test the partial result using `table()`
```{r}
table(substr(students$birthplace, start = 1, stop=3))
```
## Complete the data tidying up
Write the code to clean up the survey data.
Not all cases can be solved with this strategy since some values start with `"Turkey"`. We solve these in the next part.
All values should be like `"City/Country"`. No spaces, no comma, in English.
If the city is not specified, write `"-/Country"`. If the country is not specified, find it from the city name.
```{r q1}
# Write here
```
## Solve the remaining cases
When you finish the previous question, you will still have a few cases that cannot be solved by only looking at the first three letters. Please solve these cases now.
```{r q2}
# Write here
```
Your code should work for the current data and also for new data that may come in the future.
## How many people from each city?
Tell the computer to count how many students come from each city.
```{r q3}
# Write here
```
## (bonus) How many people form each country?
```{r q4}
# Write here
```