Class 18: Practice on tidying up data

Computing in Molecular Biology and Genetics 1

Andrés Aravena, PhD

23 November 2020

Right now you should be able to

  • Write a structured document in Markdown
  • Combine R and Markdown
  • Use vectors as indices of vectors
  • Read data from text files
    • Using readr
  • Select single columns as vectors
  • Install new R packages
  • Use filter(), select(), and %>%

Today’s Goal

Tidy up survey data

Answer some interesting questions

We load the data

library(readr)
students <- read_tsv("students2018-2020.tsv")

Filtering and selecting

Which are the students this semester?

To answer this question we need to check the answer date

There is and old way, that you may know if you did this course previously. This method uses indices.

students[students$handness=="Left" & students$answer_date > "2020-01-01", ]
# A tibble: 5 x 10
  answer_date id    english_level sex   birthdate  birthplace height_cm
  <date>      <chr> <chr>         <chr> <date>     <chr>          <dbl>
1 2020-10-19  242b… I can unders… Fema… 2001-11-01 İstanbul,…    162   
2 2020-10-19  5012… I can read a… Male  1999-10-29 Bodrum/Mu…    180   
3 2020-10-19  52b1… I can unders… Fema… 2000-12-06 Ordu/Turk…      1.63
4 2020-10-22  412e… I can unders… Fema… 1999-05-02 Turkey        168   
5 2020-11-05  242b… I can unders… Fema… 2001-11-01 İstanbul/…    162   
# … with 3 more variables: weight_kg <dbl>, handness <chr>, hand_span <dbl>

Avoid repeating the name

In the “old” way we had to write students many times

With filter() we do not need to repeat the name

library(dplyr)
filter(students, handness=="Left" & answer_date > "2020-01-01")
# A tibble: 5 x 10
  answer_date id    english_level sex   birthdate  birthplace height_cm
  <date>      <chr> <chr>         <chr> <date>     <chr>          <dbl>
1 2020-10-19  242b… I can unders… Fema… 2001-11-01 İstanbul,…    162   
2 2020-10-19  5012… I can read a… Male  1999-10-29 Bodrum/Mu…    180   
3 2020-10-19  52b1… I can unders… Fema… 2000-12-06 Ordu/Turk…      1.63
4 2020-10-22  412e… I can unders… Fema… 1999-05-02 Turkey        168   
5 2020-11-05  242b… I can unders… Fema… 2001-11-01 İstanbul/…    162   
# … with 3 more variables: weight_kg <dbl>, handness <chr>, hand_span <dbl>

comma is AND

Now we can use several conditions, separated by comma

filter(students, handness=="Left" , answer_date > "2020-01-01")
# A tibble: 5 x 10
  answer_date id    english_level sex   birthdate  birthplace height_cm
  <date>      <chr> <chr>         <chr> <date>     <chr>          <dbl>
1 2020-10-19  242b… I can unders… Fema… 2001-11-01 İstanbul,…    162   
2 2020-10-19  5012… I can read a… Male  1999-10-29 Bodrum/Mu…    180   
3 2020-10-19  52b1… I can unders… Fema… 2000-12-06 Ordu/Turk…      1.63
4 2020-10-22  412e… I can unders… Fema… 1999-05-02 Turkey        168   
5 2020-11-05  242b… I can unders… Fema… 2001-11-01 İstanbul/…    162   
# … with 3 more variables: weight_kg <dbl>, handness <chr>, hand_span <dbl>

Using pipes

students %>% filter(handness=="Left" , answer_date > "2020-01-01")
# A tibble: 5 x 10
  answer_date id    english_level sex   birthdate  birthplace height_cm
  <date>      <chr> <chr>         <chr> <date>     <chr>          <dbl>
1 2020-10-19  242b… I can unders… Fema… 2001-11-01 İstanbul,…    162   
2 2020-10-19  5012… I can read a… Male  1999-10-29 Bodrum/Mu…    180   
3 2020-10-19  52b1… I can unders… Fema… 2000-12-06 Ordu/Turk…      1.63
4 2020-10-22  412e… I can unders… Fema… 1999-05-02 Turkey        168   
5 2020-11-05  242b… I can unders… Fema… 2001-11-01 İstanbul/…    162   
# … with 3 more variables: weight_kg <dbl>, handness <chr>, hand_span <dbl>
students %>% filter(handness=="Left") %>% filter(answer_date > "2020-01-01")
# A tibble: 5 x 10
  answer_date id    english_level sex   birthdate  birthplace height_cm
  <date>      <chr> <chr>         <chr> <date>     <chr>          <dbl>
1 2020-10-19  242b… I can unders… Fema… 2001-11-01 İstanbul,…    162   
2 2020-10-19  5012… I can read a… Male  1999-10-29 Bodrum/Mu…    180   
3 2020-10-19  52b1… I can unders… Fema… 2000-12-06 Ordu/Turk…      1.63
4 2020-10-22  412e… I can unders… Fema… 1999-05-02 Turkey        168   
5 2020-11-05  242b… I can unders… Fema… 2001-11-01 İstanbul/…    162   
# … with 3 more variables: weight_kg <dbl>, handness <chr>, hand_span <dbl>

Choose columns

survey[c("weight_kg", "height_cm")]
select(survey, weight_kg, height_cm)
survey %>% select(weight_kg, height_cm)

Did you attend to “Introduction to Computer Science”?

In that course we used UNIX command line to process data

We did something like

cat survey.txt | grep "Left" | cut -f 6,7 |sort > answer.txt

Here we follow the same philosophy

Combine different tools with pipe

  • Arrange (sort)
  • Mutate
  • summarize
  • group_by