Class 13: Tibbles and data frames

Computing in Molecular Biology and Genetics 1

Andrés Aravena, PhD

16 November 2020

Interacting with the real world

Science

Experiments produce data

either numeric, logic, or text

Data from experiments

  • The variables are decided before doing the experiment

    • (for example, when we write the quiz questions)
    • The number variables is fixed during the experiment
  • The observations are found during the experiment

    • We get more and more observations
    • Limited only by time and money

It is hard to add new columns in a text file

But it is very easy to add rows

Therefore we write observations as rows,

and variables as columns

Data is organized in tables

(at least, most of the times)

One observation on each row

One variable on each column

Data comes from other programs

  • Data enters the computer from instruments

  • Most modern instruments have digital output

  • In some cases it has to be entered manually

  • This is dangerous, humans make many mistakes

For us, data always comes from another program

Typical data formats

There are several file formats used to store data tables

The most common are

  • Text file with Tab-separated values
  • Text file with Comma-separated values
  • Spreadsheets, like Excel® or Google Sheets®

For now, we work with tab- and comma-separated values

Example data

Today we will use data from

http://www.dry-lab.org/static/2020/ cmb1/students2018-2020.tsv

Take a look at it

What can you say about it?

It is a text file, with tab separated values

Reading student’s data

The classical way to read this data is using

Environment → Import Dataset → From text (base)

which corresponds to the command

survey <- read.delim("students2018-2020.tsv")

(you can load data with the menu or the keyboard)

The result is a data frame

survey
    answer_date     id                              english_level    sex
1    2018-09-17 3e501d                       I can speak fluently   Male
2    2018-09-17 479d88  I can understand movies without subtitles Female
3    2018-09-17 39df0d I can read and understand technical papers Female
4    2018-09-17 d2b091 I can read and understand technical papers   Male
5    2018-09-17 f22b12 I can read and understand technical papers Female
6    2018-09-17 849c75                       İngilizce bilmiyorum Female
7    2018-09-17 83812b                       I can speak fluently Female
8    2018-09-17 b0dde9 I can read and understand technical papers   Male
9    2018-09-17 297223 I can read and understand technical papers Female
10   2018-09-17 72c073 I can read and understand technical papers Female
11   2018-09-17 d29251 I can read and understand technical papers   Male
12   2018-09-17 6f0831 I can read and understand technical papers Female
13   2018-09-17 75b355 I can read and understand technical papers Female
14   2018-09-17 0b0da7 I can read and understand technical papers Female
15   2018-09-17 352b9f I can read and understand technical papers Female
16   2018-09-17 6f28ac I can read and understand technical papers Female
17   2018-09-17 ee5ef4 I can read and understand technical papers Female
18   2018-09-17 ba52ec I can read and understand technical papers   Male
19   2018-09-17 9d98b6 I can read and understand technical papers Female
20   2018-09-17 f92274                       I can speak fluently Female
21   2018-09-17 1c7531 I can read and understand technical papers Female
22   2018-09-17 8c9730  I can understand movies without subtitles   Male
23   2018-09-18 371f15 I can read and understand technical papers Female
24   2018-09-18 52766e I can read and understand technical papers Female
25   2018-09-18 644c22 I can read and understand technical papers Female
26   2018-09-18 df8cf1 I can read and understand technical papers Female
27   2018-09-18 c0bd32  I can understand movies without subtitles Female
28   2018-09-19 ddbc78                       İngilizce bilmiyorum Female
29   2018-09-19 6c394f  I can understand movies without subtitles   Male
30   2018-09-19 9fb139                       İngilizce bilmiyorum Female
31   2018-09-20 70bd4d I can write poetry better than Shakespeare   Male
32   2018-09-20 567104 I can read and understand technical papers Female
33   2018-09-20 b2571a I can read and understand technical papers Female
34   2018-09-20 dcc268 I can read and understand technical papers   Male
35   2018-09-20 ac1b6f  I can understand movies without subtitles   Male
36   2018-09-20 89cd86                       I can speak fluently   Male
37   2018-09-20 ba5f4b I can read and understand technical papers Female
38   2018-09-20 ba5f4b I can read and understand technical papers Female
39   2018-09-21 b45951                       İngilizce bilmiyorum   Male
40   2018-09-21 c6208d I can read and understand technical papers   Male
41   2018-09-23 412ea2  I can understand movies without subtitles Female
42   2018-09-24 b741bc I can read and understand technical papers Female
43   2018-09-24 715173 I can read and understand technical papers Female
44   2018-09-24 bc23db I can read and understand technical papers   Male
45   2018-09-24 e9d1f5 I can read and understand technical papers   Male
46   2018-09-24 08d7a1              English is my native language Female
47   2018-09-24 08d7a1              English is my native language Female
48   2018-09-24 219959  I can understand movies without subtitles Female
49   2018-09-24 383ce5                       İngilizce bilmiyorum Female
50   2018-09-24 7b5198                       I can speak fluently Female
51   2018-09-24 68efdf I can read and understand technical papers Female
52   2018-09-24 7afb3f                       İngilizce bilmiyorum   Male
53   2018-09-24 cbda9b I can read and understand technical papers   Male
54   2018-09-24 3a597c                       I can speak fluently   Male
55   2018-09-24 cd7205 I can read and understand technical papers   Male
56   2018-09-24 dcaf3d  I can understand movies without subtitles   Male
57   2018-09-24 dcaf3d  I can understand movies without subtitles   Male
58   2018-09-29 70de11 I can read and understand technical papers Female
59   2018-10-04 b43e2b I can read and understand technical papers   Male
60   2018-10-06 3b85c4  I can understand movies without subtitles Female
61   2018-10-08 6961a2  I can understand movies without subtitles   Male
62   2018-10-09 0dd83b I can read and understand technical papers   <NA>
63   2018-10-11 213231                       I can speak fluently Female
64   2018-10-11 998d64                       İngilizce bilmiyorum   Male
65   2018-10-15 008c4d  I can understand movies without subtitles   Male
66   2018-11-07 7955ff                       I can speak fluently   Male
67   2018-11-09 a896b2 I can read and understand technical papers Female
68   2019-09-25 b2571a I can read and understand technical papers Female
69   2019-09-27 68a1cf                       İngilizce bilmiyorum Female
70   2019-09-27 dbf5bc I can read and understand technical papers Female
71   2019-09-29 a7ff02                       İngilizce bilmiyorum Female
72   2019-10-01 cbda9b I can read and understand technical papers   Male
73   2019-10-07 3a597c                       I can speak fluently   Male
74   2019-10-09 213231                       I can speak fluently Female
75   2019-10-09 1e2e83  I can understand movies without subtitles   Male
76   2019-10-11 a45fe6                       İngilizce bilmiyorum Female
77   2019-10-14 6961a2  I can understand movies without subtitles   Male
78   2019-10-14 7b5198                       I can speak fluently Female
79   2019-10-14 68efdf I can read and understand technical papers Female
80   2019-10-15 08d7a1              English is my native language Female
81   2020-10-19 70f3de                       I can speak fluently Female
82   2020-10-19 b81bd1  I can understand movies without subtitles Female
83   2020-10-19 692637  I can understand movies without subtitles Female
84   2020-10-19 42c891              English is my native language   Male
85   2020-10-19 242bf7  I can understand movies without subtitles Female
86   2020-10-19 cd7205                       I can speak fluently   Male
87   2020-10-19 f8d60d                       I can speak fluently Female
88   2020-10-19 47e2e0 I can read and understand technical papers Female
89   2020-10-19 50988d I can read and understand technical papers Female
90   2020-10-19 60a92f I can read and understand technical papers Female
91   2020-10-19 432cf7                       I can speak fluently   Male
92   2020-10-19 9bba74 I can read and understand technical papers Female
93   2020-10-19 a7ff02 I can read and understand technical papers Female
94   2020-10-19 5012ed I can read and understand technical papers   Male
95   2020-10-19 91e5e8  I can understand movies without subtitles Female
96   2020-10-19 fe26f8  I can understand movies without subtitles Female
97   2020-10-19 4f5875                       I can speak fluently Female
98   2020-10-19 52b150  I can understand movies without subtitles Female
99   2020-10-21 d29251 I can read and understand technical papers   Male
100  2020-10-21 849c75                       İngilizce bilmiyorum Female
101  2020-10-21 c9a95d I can read and understand technical papers Female
102  2020-10-21 2f4b15 I can read and understand technical papers Female
103  2020-10-22 3fe6b5 I can read and understand technical papers Female
104  2020-10-22 412ea2  I can understand movies without subtitles Female
105  2020-10-23 a45fe6 I can read and understand technical papers Female
106  2020-10-23 287c3a  I can understand movies without subtitles Female
107  2020-10-24 6961a2  I can understand movies without subtitles   Male
108  2020-10-24 6961a2  I can understand movies without subtitles   Male
109  2020-10-26 6e5137                       I can speak fluently Female
110  2020-10-26 3a597c                       I can speak fluently   Male
111  2020-10-26 f5dafd I can read and understand technical papers Female
112  2020-11-05 242bf7  I can understand movies without subtitles Female
113  2020-11-05 91e5e8 I can read and understand technical papers Female
114  2020-11-05 60a92f I can read and understand technical papers Female
115  2020-11-05 b041ba  I can understand movies without subtitles   Male
116  2020-11-06 c9b8b1                       İngilizce bilmiyorum Female
117  2020-11-06 68a1cf I can read and understand technical papers Female
     birthdate             birthplace height_cm weight_kg handness hand_span
1   1993-02-01                 turkey    179.00      67.0    Right      15.0
2   1998-05-21          Kahramanmaraş      1.68      55.0    Right      14.0
3   1998-01-18        Batman, Türkiye        NA        NA    Right      18.0
4   1998-08-29         Antalya,Turkey    170.00      74.0    Right      25.0
5   1998-05-03                  izmir    162.00      68.0    Right      13.0
6   1995-10-09       Türkiye / Yalova    167.00      58.0    Right      18.0
7   1997-09-19        Adıyaman,Turkey    174.00      72.0    Right      16.0
8   1997-11-27                  Bursa    180.00      68.0    Right      19.0
9   1999-01-02       İstanbul/Türkiye    162.00      58.0    Right      19.0
10  1998-10-02        İstanbul,Turkey    172.00      55.0    Right      20.0
11  1997-05-18             VAN/TURKEY    181.00      81.0    Right      20.0
12  1997-12-08                   <NA>        NA        NA    Right      20.0
13  1997-10-13           Sümeyye Onat    155.00      42.5    Right      20.0
14  1998-02-03               Istanbul        NA        NA    Right      30.0
15  1998-06-10               İstanbul      1.59      69.0    Right      18.0
16  1998-05-17        Samsun, Türkiye    165.00      58.0    Right      19.0
17  1997-07-07          Mardin,Turkey    166.00      47.0    Right      20.0
18  1998-10-13       gaziantep turkey    182.00      78.0    Right      21.0
19  1998-06-09        İstanbul,Turkey    158.00      57.0    Right      19.0
20  2018-09-03        Yıldırım, BURSA      1.64      55.0    Right      20.0
21  1998-09-17        Istanbul/Turkey    173.00      55.0    Right       8.0
22  1998-07-28         Bursa / TURKEY    185.00      65.0     Left      22.0
23  1998-08-17                 Yalova    163.00      60.0    Right      15.0
24  1998-03-24            Ordu Turkey    167.00      50.0    Right      30.0
25  2018-04-24       Istanbul, Turkey        NA        NA    Right      19.0
26  1997-10-13               İstanbul    171.00      52.0    Right      25.0
27  1997-05-18        Edirne, Türkiye    165.00      54.0    Right      18.0
28  1997-01-14       Malatya, Türkiye    162.00      75.0     Left      18.0
29  1997-06-25                   <NA>    188.00     105.0    Right      20.0
30  1995-01-28  Türkiye/Hatay/Antakya      1.70      56.0     Left      18.0
31  2018-12-08               istanbul        NA        NA    Right      20.0
32  1997-07-03                  Çorum    160.00      50.0    Right      15.0
33  1996-01-04               İstanbul        NA        NA     Left      15.0
34  1997-01-05           Muğla/Turkey    178.00      67.0    Right      24.0
35  1997-12-26                   City    176.00      59.0    Right      24.0
36  1998-10-31       Istanbul, TURKEY    184.00      75.0    Right      22.0
37  1991-01-01                 Suriye    160.00      60.0    Right      19.0
38  1991-01-01                 Suriye    160.00      60.0    Right      19.0
39  1998-01-10        Yıldırım, Bursa    175.00     106.0    Right      15.0
40  1992-08-11          Malatya/Turky      1.80      94.0    Right      25.0
41  1999-05-02              Balıkesir    165.00      63.0     Left      17.0
42  1997-07-29       Istanbul/Türkiye      1.60      54.0    Right      20.0
43  1998-02-05  Nakhchivan/Azerbaijan      1.57      53.0    Right      20.0
44  1998-11-19             Azerbaijan    175.00      75.0    Right      20.0
45  1997-02-09           Sivas,Turkey    183.00      70.0    Right      20.0
46  1997-06-30                 Ankara    158.00      65.0    Right       8.0
47  1997-06-30                 Ankara    158.00      65.0    Right       8.0
48  1998-09-03                 Samsun    174.00      55.0    Right      22.0
49  1998-11-16          Adana,türkiye    163.00      68.0    Right      13.0
50  1999-05-23     Almaty, Kazakhstan    178.00      55.0    Right      12.0
51  1998-04-07               istanbul    165.00        NA    Right       9.0
52  1997-05-01        Antalya/Türkiye    173.00      80.0    Right      16.0
53  1996-09-26           Hatay/Turkey    175.00      77.0    Right      18.0
54  1993-03-14      Tekirdag / Turkey    195.00      85.0    Right      30.0
55  1997-12-06                 turkey    166.00      65.0    Right      15.0
56  1998-11-06           İzmir-Turkey    163.00      64.0    Right      15.0
57  1998-11-06           İzmir-Turkey    163.00      64.0    Right      15.0
58  1998-09-01             Van,Turkey    174.00      60.0    Right      24.0
59  2018-01-15          Bursa,türkiye    175.00      76.0    Right      20.0
60  1996-04-05         Tunceli,Turkey    173.00      56.0    Right      21.0
61  1994-01-01                 Aleppo        NA      78.0    Right      25.0
62        <NA>                   <NA>        NA        NA    Right      22.0
63        <NA>                   <NA>        NA        NA    Right      17.0
64  1996-03-09               İstanbul    177.00      77.0    Right      23.0
65  1996-10-25     Safranbolu/KARABUK    181.00      72.0     Left      26.0
66  1994-01-05                   <NA>        NA        NA    Right      25.0
67  1998-04-18               İstanbul    165.00      58.0    Right      20.5
68  1996-01-04               İstanbul        NA        NA     Left      20.0
69  1995-03-26                 YALOVA    168.00      66.0    Right      18.0
70  1994-08-18    Edremit (Balıkesir)      1.64      52.0    Right      19.0
71  1997-03-23           Turkmenistan    179.00        NA    Right      18.0
72  1996-09-26          Hatay/Antakya    175.00      73.0    Right      20.0
73  1993-03-14        Tekirdağ/Turkey    195.00      82.0    Right      25.0
74  2019-06-06               İstanbul    160.00      55.0    Right      17.0
75  1996-10-25               İstanbul    180.00      86.0    Right      23.0
76  1997-02-03                  Sivas    161.00      63.0    Right      18.0
77  1994-01-01           Aleppo/Syria    183.00      85.0    Right      22.0
78  1999-05-23     Almaty, Kazakhstan    178.00      58.0    Right      21.0
79  1998-04-07               istanbul    165.00      65.0    Right      20.0
80  1997-06-30                 Ankara    158.00      65.0    Right      14.0
81  2000-11-07         Konya, Türkiye    165.00      70.0    Right      18.0
82  2001-12-25         Afyon, Türkiye    169.00        NA    Right      21.0
83  1999-05-23       Antalya, Türkiye    167.00      47.0    Right      20.0
84  1994-01-05               tekirdag      1.80      82.0    Right      21.0
85  2001-11-01      İstanbul, Türkiye    162.00      70.0     Left      16.0
86  1997-06-12             Kırklareli    169.00      75.0    Right      20.0
87  1998-02-20          Aydın, Turkey    165.00      47.0    Right      21.0
88  1997-07-24        İstanbul,Turkey    168.00      72.0    Right      21.0
89  2000-12-28      Hannover, Germany    171.00        NA    Right      18.0
90  1998-12-28       Istanbul/ Turkey    171.00      61.0    Right      21.0
91  2001-07-04         Mersin, Turkey    184.00      79.0    Right      25.0
92  2000-01-22          TÜrkiye/Bursa    165.00      55.0    Right      14.0
93  1997-03-23           Turkmenistan    179.00        NA    Right      21.0
94  1999-10-29           Bodrum/Muğla    180.00      74.0     Left      23.0
95  2000-07-26 Afyonkarahisar, Turkey    164.00      47.0    Right      19.0
96  2000-04-15       Istanbul/ Turkey    156.00      54.0    Right      15.0
97  1998-01-21               Istanbul        NA        NA    Right      19.0
98  2000-12-06            Ordu/Turkey      1.63      60.0     Left      19.0
99  1997-05-18           VAN / TURKEY    183.00      74.0    Right      19.5
100 1995-10-09     OSMANGAZİ, TÜRKİYE    167.00      56.0    Right      17.0
101 1996-08-14         Manisa/ Turkey        NA        NA    Right      18.0
102 1998-08-02       Turkey /İstanbul      1.75      65.0    Right      20.0
103 1999-03-21       Istanbul, Turkey    162.00      49.0    Right      17.0
104 1999-05-02                 Turkey    168.00      63.0     Left      18.0
105 1997-02-03           Sivas,Turkey    161.00      65.0    Right      18.0
106 1999-06-22      İstanbul, Türkiye    165.00      47.0    Right      18.0
107 1994-01-01               istanbul    184.00      90.0    Right      23.0
108 1994-01-01               istanbul    184.00      90.0    Right      23.0
109 2001-08-01       Istanbul/ Turkey    162.00      76.0    Right      24.0
110 1993-03-14       Tekirdağ, Turkey    195.00      88.0    Right      24.0
111 1977-03-08               İstanbul    167.00      80.0    Right      22.0
112 2001-11-01       İstanbul/Türkiye    162.00      72.0     Left      16.0
113 2000-07-26 Afyonkarahisar, Turkey    164.00      47.0    Right      19.0
114 1998-12-28               İstanbul    171.00      61.0    Right      21.0
115 1991-11-15               Istanbul    192.00      95.0    Right      26.0
116 1996-01-18        istanbul,turkey    168.00      67.0    Right      21.0
117 1995-03-26       YALOVA / TÜRKİYE    168.00      80.0    Right      15.0

Data Frames

  • Bidimensional structures

  • Each column can be of a different type

  • All columns have the same length

  • All columns need a name

  • Usually too big to print

Showing a big data frame

How can we see survey

In Rstudio we can use the command

View(survey)

But this does not work on Rmarkdown,

so we cannot use it in a paper or report

It’s easy to see the first rows

head(survey)
  answer_date     id                              english_level    sex
1  2018-09-17 3e501d                       I can speak fluently   Male
2  2018-09-17 479d88  I can understand movies without subtitles Female
3  2018-09-17 39df0d I can read and understand technical papers Female
4  2018-09-17 d2b091 I can read and understand technical papers   Male
5  2018-09-17 f22b12 I can read and understand technical papers Female
6  2018-09-17 849c75                       İngilizce bilmiyorum Female
   birthdate       birthplace height_cm weight_kg handness hand_span
1 1993-02-01           turkey    179.00        67    Right        15
2 1998-05-21    Kahramanmaraş      1.68        55    Right        14
3 1998-01-18  Batman, Türkiye        NA        NA    Right        18
4 1998-08-29   Antalya,Turkey    170.00        74    Right        25
5 1998-05-03            izmir    162.00        68    Right        13
6 1995-10-09 Türkiye / Yalova    167.00        58    Right        18

Notice that there are too many columns

How many observations?

One basic question we need to answer is how many observations are in our data frame

In other words, we want to know the number of rows

Use the command

nrow(survey)
[1] 117

How many variables?

We also want to know what is the number of columns

ncol(survey)
[1] 10

Together, the number of rows and columns is called dimension

dim(survey)
[1] 117  10

What are the variable names

Each column represents a variable

The column name is the name of the variable

colnames(survey)
 [1] "answer_date"   "id"            "english_level" "sex"          
 [5] "birthdate"     "birthplace"    "height_cm"     "weight_kg"    
 [9] "handness"      "hand_span"    

Accessing single columns

You can use $ to get the vector on each column

survey$weight_kg
  [1]  67.0  55.0    NA  74.0  68.0  58.0  72.0  68.0  58.0  55.0  81.0    NA
 [13]  42.5    NA  69.0  58.0  47.0  78.0  57.0  55.0  55.0  65.0  60.0  50.0
 [25]    NA  52.0  54.0  75.0 105.0  56.0    NA  50.0    NA  67.0  59.0  75.0
 [37]  60.0  60.0 106.0  94.0  63.0  54.0  53.0  75.0  70.0  65.0  65.0  55.0
 [49]  68.0  55.0    NA  80.0  77.0  85.0  65.0  64.0  64.0  60.0  76.0  56.0
 [61]  78.0    NA    NA  77.0  72.0    NA  58.0    NA  66.0  52.0    NA  73.0
 [73]  82.0  55.0  86.0  63.0  85.0  58.0  65.0  65.0  70.0    NA  47.0  82.0
 [85]  70.0  75.0  47.0  72.0    NA  61.0  79.0  55.0    NA  74.0  47.0  54.0
 [97]    NA  60.0  74.0  56.0    NA  65.0  49.0  63.0  65.0  47.0  90.0  90.0
[109]  76.0  88.0  80.0  72.0  47.0  61.0  95.0  67.0  80.0

Data privacy

This data is real, and belongs to you

To use it here, we deleted some of your personal data

It does not show your name, email or student number

Instead, there is an id column, unique to each person

Digital signature

head(survey$id)
[1] "3e501d" "479d88" "39df0d" "d2b091" "f22b12" "849c75"

The id column was created using a digital signature
(we discuss them in class 14)

Same id is always same person. But privacy is preserved

This is one step to do a blind analysis

This is a way to keep anonymity

It is essential to keep anonymity of patients data

And to avoid researcher bias

Looking only to some rows

As with vectors, we want to choose which parts to see

We can use logic values to filter the rows

For example, we may want to know about left-handed people attending to our course this year

survey$handness=="Left" & survey$answer_date >= "2020-01-01"

There are many ways to filter a data frame

For example, we can do this

subset(survey, handness=="Left" & answer_date >= "2020-01-01")
    answer_date     id                              english_level    sex
85   2020-10-19 242bf7  I can understand movies without subtitles Female
94   2020-10-19 5012ed I can read and understand technical papers   Male
98   2020-10-19 52b150  I can understand movies without subtitles Female
104  2020-10-22 412ea2  I can understand movies without subtitles Female
112  2020-11-05 242bf7  I can understand movies without subtitles Female
     birthdate        birthplace height_cm weight_kg handness hand_span
85  2001-11-01 İstanbul, Türkiye    162.00        70     Left        16
94  1999-10-29      Bodrum/Muğla    180.00        74     Left        23
98  2000-12-06       Ordu/Turkey      1.63        60     Left        19
104 1999-05-02            Turkey    168.00        63     Left        18
112 2001-11-01  İstanbul/Türkiye    162.00        72     Left        16

Here we do not need to write survey$

Good and bad of Data frames

  • They are a good way to handle experimental data
  • Each column is a vector, that we can handle

but…

  • They are hard to see in the screen
  • Sometimes reading them gives funny results
    • As it happened in class 12 loading COVID-19 data

Modern data frames

Tibbles

In the last years people has improved data frames to make them easier to use

The new version is called tibble

All tibbles are data frames,
but not all data frames are tibbles

Loading data into a tibble

The easiest way to load data is to use the menu

Environment → Import Dataset → From Text (readr)…

Or directly in the command line

library(readr)
students <- read_tsv("students2018-2020.tsv")

── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
  answer_date = col_date(format = ""),
  id = col_character(),
  english_level = col_character(),
  sex = col_character(),
  birthdate = col_date(format = ""),
  birthplace = col_character(),
  height_cm = col_double(),
  weight_kg = col_double(),
  handness = col_character(),
  hand_span = col_double()
)

(we will explain library(readr) later)

The tibble looks like this

students
# A tibble: 117 x 10
   answer_date id    english_level sex   birthdate  birthplace height_cm
   <date>      <chr> <chr>         <chr> <date>     <chr>          <dbl>
 1 2018-09-17  3e50… I can speak … Male  1993-02-01 turkey        179   
 2 2018-09-17  479d… I can unders… Fema… 1998-05-21 Kahramanm…      1.68
 3 2018-09-17  39df… I can read a… Fema… 1998-01-18 Batman, T…     NA   
 4 2018-09-17  d2b0… I can read a… Male  1998-08-29 Antalya,T…    170   
 5 2018-09-17  f22b… I can read a… Fema… 1998-05-03 izmir         162   
 6 2018-09-17  849c… İngilizce bi… Fema… 1995-10-09 Türkiye /…    167   
 7 2018-09-17  8381… I can speak … Fema… 1997-09-19 Adıyaman,…    174   
 8 2018-09-17  b0dd… I can read a… Male  1997-11-27 Bursa         180   
 9 2018-09-17  2972… I can read a… Fema… 1999-01-02 İstanbul/…    162   
10 2018-09-17  72c0… I can read a… Fema… 1998-10-02 İstanbul,…    172   
# … with 107 more rows, and 3 more variables: weight_kg <dbl>, handness <chr>,
#   hand_span <dbl>

This is much easier to read

Tibble dimensions

These commands work in tibbles as in data frames

dim(students)
[1] 117  10
nrow(students)
[1] 117
ncol(students)
[1] 10

Selecting columns by name

As before, we can ask for column names

colnames(students)
 [1] "answer_date"   "id"            "english_level" "sex"          
 [5] "birthdate"     "birthplace"    "height_cm"     "weight_kg"    
 [9] "handness"      "hand_span"    

Each column can be accessed by its name

table(students$handness)

 Left Right 
   12   105 

The question for this class

What is the height of left-handed people?

To answer this question, we need new tools

Let’s get new tools for our R

Adding commands to R

What is library(readr)?

Remember how we read data from the file

library(readr)
students <- read_tsv("students2018-2020.tsv")

Now we will explain library(readr):

We use it to enable the read_tsv() command

Base commands and extensions

Out of the box, your R system has many commands

But there are more commands, that you can also use

The new commands are in packages or libraries

To enable a package, we use the command library()

Packages currently installed

Use library() with installed packages

If you click on the package name, you can see what are its commands

To use them, write library(package name)

You need to do this once in every session

What if you need more packages?

The “App Store”

If the package is not in your computer,
you need to use install.packages()

This command download new packages from the web

We install only one time

We load every time we need them

Install a package in Rstudio

You can use the menu Packages → Install

We need several new packages

To work with tibbles we need to install several packages

  • dplyr,
  • ggplotq,
  • magrittr,
  • readr,
  • readxl,
  • tidyr,
  • and others

This set of packages is called tidyverse

Let’s instal tidyverse

It is easier in the command line

In the command line, you write

install.packages("tidyverse")

This command will download all the packages
and store them in your computer

You only need to do this one time.

Tidyverse has several parts

We will use several packages from tidyverse

There is a lot of material free online

Read it. Watch it

Today we use only the dplyr package

Load the dplyr package

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Do not pay attention to the warning messages

We will deal with them later

Choosing some rows

We can easily choose the relevant rows

filter(students, handness=="Left" & answer_date > "2020-01-01")
# A tibble: 5 x 10
  answer_date id    english_level sex   birthdate  birthplace height_cm
  <date>      <chr> <chr>         <chr> <date>     <chr>          <dbl>
1 2020-10-19  242b… I can unders… Fema… 2001-11-01 İstanbul,…    162   
2 2020-10-19  5012… I can read a… Male  1999-10-29 Bodrum/Mu…    180   
3 2020-10-19  52b1… I can unders… Fema… 2000-12-06 Ordu/Turk…      1.63
4 2020-10-22  412e… I can unders… Fema… 1999-05-02 Turkey        168   
5 2020-11-05  242b… I can unders… Fema… 2001-11-01 İstanbul/…    162   
# … with 3 more variables: weight_kg <dbl>, handness <chr>, hand_span <dbl>

(notice that we use == for comparisons)

Choosing some columns

select(students, weight_kg, height_cm)
# A tibble: 117 x 2
   weight_kg height_cm
       <dbl>     <dbl>
 1        67    179   
 2        55      1.68
 3        NA     NA   
 4        74    170   
 5        68    162   
 6        58    167   
 7        72    174   
 8        68    180   
 9        58    162   
10        55    172   
# … with 107 more rows

Selecting rows

We can use the result of this comparison as a row index

left_handed <- filter(students, handness=="Left" & answer_date > "2020-01-01")
select(left_handed, answer_date, weight_kg, height_cm)
# A tibble: 5 x 3
  answer_date weight_kg height_cm
  <date>          <dbl>     <dbl>
1 2020-10-19         70    162   
2 2020-10-19         74    180   
3 2020-10-19         60      1.63
4 2020-10-22         63    168   
5 2020-11-05         72    162   

Another way to do assignment

Normally we use <- for assignment

x <- 2

There is another way, that is sometimes nicer

2 -> x

The -> arrow goes from the value to the variable

We can rewrite the combination

filter(students, handness=="Left" & answer_date > "2020-01-01") -> left_handed 
select(left_handed, answer_date, weight_kg, height_cm)
# A tibble: 5 x 3
  answer_date weight_kg height_cm
  <date>          <dbl>     <dbl>
1 2020-10-19         70    162   
2 2020-10-19         74    180   
3 2020-10-19         60      1.63
4 2020-10-22         63    168   
5 2020-11-05         72    162   

left_handed is an intermediate variable

We use it only for one step. We don’t need it at the end

Skip the intermediate variable

filter(students, handness=="Left" & answer_date > "2020-01-01") %>% select(answer_date, weight_kg, height_cm)
# A tibble: 5 x 3
  answer_date weight_kg height_cm
  <date>          <dbl>     <dbl>
1 2020-10-19         70    162   
2 2020-10-19         74    180   
3 2020-10-19         60      1.63
4 2020-10-22         63    168   
5 2020-11-05         72    162   

The key thing is %>%, called pipe

Write it in several lines

filter(students, handness=="Left" & answer_date > "2020-01-01") %>% 
    select(answer_date, weight_kg, height_cm)
# A tibble: 5 x 3
  answer_date weight_kg height_cm
  <date>          <dbl>     <dbl>
1 2020-10-19         70    162   
2 2020-10-19         74    180   
3 2020-10-19         60      1.63
4 2020-10-22         63    168   
5 2020-11-05         72    162   

If you write %>% at the end of the line, you can continue in the next line

Pipe magic

The %>% symbol help us to write clear code.

Instead of

y <- min(x, z)

we write

x %>% min(z) -> y

The first function input is taken from the pipe

Longer pipelines

Instead of

y <- min(sqrt(sin(x)), z)

we write

x %>% sin() %>% sqrt() %>% min(z) -> y

We can read %>% as “then”

“Take x, then calculate sine, then square root, then take the smallest of the result and z, and store it in y

Homework

Cultural Research

The package providing pipes is called magrittr

Why?

Tell me in the next class

(no writing necessary)