Handling Lists and Data Frames

November 26, 2019

Sorting

Sorting a vector

We have our own data

survey <- read.table("survey1-tidy.txt")
survey$weight

 [1]  67.0  58.0  56.0  94.0  60.0  77.0  56.0  75.0
 [9]  80.0 105.0  59.0  70.0  57.0  50.0  78.0  55.0
[17] 106.0  68.0  68.0  65.0  76.0  42.5  55.0  69.0
[25]  60.0  58.0  52.0  47.0  65.0  67.0  68.0  74.0
[33]  55.0  55.0  60.0  50.0  55.0  58.0  75.0  53.0
[41]  81.0  54.0  55.0  72.0  65.0  64.0  54.0  85.0
[49]  63.0  75.0  77.0

How can we sort `weight`?

sort(survey$weight)

 [1]  42.5  47.0  50.0  50.0  52.0  53.0  54.0  54.0
 [9]  55.0  55.0  55.0  55.0  55.0  55.0  56.0  56.0
[17]  57.0  58.0  58.0  58.0  59.0  60.0  60.0  60.0
[25]  63.0  64.0  65.0  65.0  65.0  67.0  67.0  68.0
[33]  68.0  68.0  69.0  70.0  72.0  74.0  75.0  75.0
[41]  75.0  76.0  77.0  77.0  78.0  80.0  81.0  85.0
[49]  94.0 105.0 106.0

How to sort from large to small?

sort(survey$weight, decreasing=TRUE)

 [1] 106.0 105.0  94.0  85.0  81.0  80.0  78.0  77.0
 [9]  77.0  76.0  75.0  75.0  75.0  74.0  72.0  70.0
[17]  69.0  68.0  68.0  68.0  67.0  67.0  65.0  65.0
[25]  65.0  64.0  63.0  60.0  60.0  60.0  59.0  58.0
[33]  58.0  58.0  57.0  56.0  56.0  55.0  55.0  55.0
[41]  55.0  55.0  55.0  54.0  54.0  53.0  52.0  50.0
[49]  50.0  47.0  42.5

How to sort the complete data frame?

The command sort() works only for vectors

To sort a data frame, we first need to choose which column we use to order

We know the position of the smallest and the largest

Position of the smallest and largest

which.min(survey$weight)

[1] 22

which.max(survey$weight)

[1] 17

We need the positions in between

For that we use the order() command

Let’s verify if this is correct

survey$weight

 [1]  67.0  58.0  56.0  94.0  60.0  77.0  56.0  75.0
 [9]  80.0 105.0  59.0  70.0  57.0  50.0  78.0  55.0
[17] 106.0  68.0  68.0  65.0  76.0  42.5  55.0  69.0
[25]  60.0  58.0  52.0  47.0  65.0  67.0  68.0  74.0
[33]  55.0  55.0  60.0  50.0  55.0  58.0  75.0  53.0
[41]  81.0  54.0  55.0  72.0  65.0  64.0  54.0  85.0
[49]  63.0  75.0  77.0

Using `order()` to sort a data frame

order(survey$weight)

 [1] 22 28 14 36 27 40 42 47 16 23 33 34 37 43  3  7 13
[18]  2 26 38 11  5 25 35 49 46 20 29 45  1 30 18 19 31
[35] 24 12 44 32  8 39 50 21  6 51 15  9 41 48  4 10 17

This gives us the position of the smallest, the second smallest, and so on up to the largest

Then we do `survey[order(survey$weight),]`

     Gender birth_day birth_month birth_year height_cm weight_kg handness
st22 Female        13          10       1997       155      42.5    Right
st28 Female         7           7       1997       166      47.0    Right
st14 Female         3           7       1997       160      50.0    Right
st36 Female        24           3       1998       167      50.0    Right
st27 Female        13          10       1997       171      52.0    Right
st40 Female         5           2       1998       157      53.0    Right
st42 Female        18           5       1997       165      54.0    Right
st47 Female        29           7       1997       160      54.0    Right
st16 Female         3           9       2018       164      55.0    Right
st23 Female         2          10       1998       172      55.0    Right
st33 Female        21           5       1998       168      55.0    Right
st34 Female         3           9       1998       174      55.0    Right
st37 Female        17           9       1998       173      55.0    Right
st43 Female        23           5       1999       178      55.0    Right
st3  Female        28           1       1995       170      56.0     Left
st7  Female         5           4       1996       173      56.0    Right
st13 Female         9           6       1998       158      57.0    Right
st2  Female         9          10       1995       167      58.0    Right
st26 Female        17           5       1998       165      58.0    Right
st38 Female         2           1       1999       162      58.0    Right
st11   Male        26          12       1997       176      59.0    Right
st5  Female         1           1       1991       160      60.0    Right
st25 Female        17           8       1998       163      60.0    Right
st35 Female         1           9       1998       174      60.0    Right
st49 Female         2           5       1999       165      63.0     Left
st46   Male         6          11       1998       163      64.0    Right
st20 Female        30           6       1997       158      65.0    Right
st29   Male        28           7       1998       185      65.0     Left
st45   Male         6          12       1997       166      65.0    Right
st1    Male         1           2       1993       179      67.0    Right
st30   Male         5           1       1997       178      67.0    Right
st18 Female        16          11       1998       163      68.0    Right
st19 Female         3           5       1998       162      68.0    Right
st31   Male        27          11       1997       180      68.0    Right
st24 Female        10           6       1998       159      69.0    Right
st12   Male         9           2       1997       183      70.0    Right
st44 Female        19           9       1997       174      72.0    Right
st32   Male        29           8       1998       170      74.0    Right
st8  Female        14           1       1997       162      75.0     Left
st39   Male        19          11       1998       175      75.0    Right
st50   Male        31          10       1998       184      75.0    Right
st21   Male        15           1       2018       175      76.0    Right
st6    Male        26           9       1996       175      77.0    Right
st51   Male         9           3       1996       177      77.0    Right
st15   Male        13          10       1998       182      78.0    Right
st9    Male         1           5       1997       173      80.0    Right
st41   Male        18           5       1997       181      81.0    Right
st48   Male        14           3       1993       195      85.0    Right
st4    Male        11           8       1992       180      94.0    Right
st10   Male        25           6       1997       188     105.0    Right
st17   Male        10           1       1998       175     106.0    Right
     hand_span_cm
st22           20
st28           20
st14           15
st36           30
st27           25
st40           20
st42           18
st47           20
st16           20
st23           20
st33           14
st34           22
st37            8
st43           12
st3            18
st7            21
st13           19
st2            18
st26           19
st38           19
st11           24
st5            19
st25           15
st35           24
st49           17
st46           15
st20            8
st29           22
st45           15
st1            15
st30           24
st18           13
st19           13
st31           19
st24           18
st12           20
st44           16
st32           25
st8            18
st39           20
st50           22
st21           20
st6            18
st51           23
st15           21
st9            16
st41           20
st48           30
st4            25
st10           20
st17           15

The “App Store”

Packages and Libraries

All interesting things in R are done using functions
- We recognize them because they use ()
Several functions of the same subject are grouped in a package
To use functions from a package we need to load them using library()
If the package is not in your computer, you need to use install.packages()

`knitr`: a package for Rmarkdown

Knitr is the system that merges R code and Markdown to produce documents that depend on data

It has many functions. We used two of them:

knitr::kable() is a function to produce nicer tables
- The mandatory input is a data.frame
- It is similar to the function pander() from the pander package
knitr::opts_chunk$set() to set the default options for each chunk

Without `kable()`

survey[1:5,]

    Gender birth_day birth_month birth_year height_cm weight_kg handness
st1   Male         1           2       1993       179        67    Right
st2 Female         9          10       1995       167        58    Right
st3 Female        28           1       1995       170        56     Left
st4   Male        11           8       1992       180        94    Right
st5 Female         1           1       1991       160        60    Right
    hand_span_cm
st1           15
st2           18
st3           18
st4           25
st5           19

With `kable()`

library(knitr)
kable(survey[1:5,])

	Gender	birth_day	birth_month	birth_year	height_cm	weight_kg	handness	hand_span_cm
st1	Male	1	2	1993	179	67	Right	15
st2	Female	9	10	1995	167	58	Right	18
st3	Female	28	1	1995	170	56	Left	18
st4	Male	11	8	1992	180	94	Right	25
st5	Female	1	1	1991	160	60	Right	19

Some hints

Use library(knitr) before using any function of the package
Remember that the RMarkdown document is independent of Console
Save your document on the X: drive (when using lab computers)

Data frame shapes

Wide data has a column for each variable

Using the data from the exam

world

  income population    area
1   1810   31700000  653000
2  10500    2920000   28800
3  13300   38300000 2380000
4   6190   26000000 1250000
5  18900      97800     440
6  19500   42500000 2780000

Long Data has one `value` column

     variable    value
1      income     1810
2      income    10500
3      income    13300
4      income     6190
5      income    18900
6      income    19500
7  population 31700000
8  population  2920000
9  population 38300000
10 population 26000000
11 population    97800
12 population 42500000
13       area   653000
14       area    28800
15       area  2380000
16       area  1250000
17       area      440
18       area  2780000

Changing the shape

We use the reshape2 library

melt takes wide-format data and melts it into long-format data.
cast takes long-format data and casts it into wide-format data.

Think of working with metal:

if you melt metal, it drips and becomes long
if you cast it into a mould, it becomes wide

Melting

library(reshape2)
melt(world, id=NULL)

     variable    value
1      income     1810
2      income    10500
3      income    13300
4      income     6190
5      income    18900
6      income    19500
7  population 31700000
8  population  2920000
9  population 38300000
10 population 26000000
11 population    97800
12 population 42500000
13       area   653000
14       area    28800
15       area  2380000
16       area  1250000
17       area      440
18       area  2780000

Melting with text columns

Consider this case

countries

              country income population    area
1         Afghanistan   1810   31700000  653000
2             Albania  10500    2920000   28800
3             Algeria  13300   38300000 2380000
4              Angola   6190   26000000 1250000
5 Antigua and Barbuda  18900      97800     440
6           Argentina  19500   42500000 2780000

The `country` is the identifier

library(reshape2)
melt(countries, id="country")

               country   variable    value
1          Afghanistan     income     1810
2              Albania     income    10500
3              Algeria     income    13300
4               Angola     income     6190
5  Antigua and Barbuda     income    18900
6            Argentina     income    19500
7          Afghanistan population 31700000
8              Albania population  2920000
9              Algeria population 38300000
10              Angola population 26000000
11 Antigua and Barbuda population    97800
12           Argentina population 42500000
13         Afghanistan       area   653000
14             Albania       area    28800
15             Algeria       area  2380000
16              Angola       area  1250000
17 Antigua and Barbuda       area      440
18           Argentina       area  2780000

Long- to wide-format

going from wide- to long-format data is easy
going from long- to wide-format data needs more care
reshape2 has several cast functions, for different structures
For data frames we use dcast
There is also acast for vector, matrix, or array
but we will not use it in this course

We start with long format

long <- melt(countries, id="country")
long

               country   variable    value
1          Afghanistan     income     1810
2              Albania     income    10500
3              Algeria     income    13300
4               Angola     income     6190
5  Antigua and Barbuda     income    18900
6            Argentina     income    19500
7          Afghanistan population 31700000
8              Albania population  2920000
9              Algeria population 38300000
10              Angola population 26000000
11 Antigua and Barbuda population    97800
12           Argentina population 42500000
13         Afghanistan       area   653000
14             Albania       area    28800
15             Algeria       area  2380000
16              Angola       area  1250000
17 Antigua and Barbuda       area      440
18           Argentina       area  2780000

Now we transform it

dcast(long, country~variable)

              country income population    area
1         Afghanistan   1810   31700000  653000
2             Albania  10500    2920000   28800
3             Algeria  13300   38300000 2380000
4              Angola   6190   26000000 1250000
5 Antigua and Barbuda  18900      97800     440
6           Argentina  19500   42500000 2780000

Not all data is a data frame

When data has no structure

So far all the files we have used is structured

That is, they have rows and columns

We use read.table and write.table to read and write a data frame

Sometimes the data is not a table

Example

people <- list(Ali=list(age=18, sex='M'), Bahar=list(age=19, sex='F'), valid=c(TRUE,FALSE))
people

$Ali
$Ali$age
[1] 18

$Ali$sex
[1] "M"


$Bahar
$Bahar$age
[1] 19

$Bahar$sex
[1] "F"


$valid
[1]  TRUE FALSE

How can we read and write lists?

YAML: format for lists

There are several options to store lists into files.
A good one is YAML, which looks like this:

Ali:
  age: 18.0
  sex: M
Bahar:
  age: 19.0
  sex: F
valid:
- yes
- no

Rules for YAML files

Each list element starts in the first column. No spaces
The inner list elements are indented with 2 spaces
You can have lists inside lists inside lists…
Name and values are separated by :
Vector elements are marked with -
When used inside Rmarkdown, put --- before and after the YAML code

Google “YAML” for more info

You have seen YAML before

We use YAML for the Rmarkdown metadata. For example

---
title: "Midterm Exam"
subtitle: "Computing in Molecular Biology 1"
author: "Put your name here"
number: STUDENT_NUMBER
date: "October 25, 2018"
output: html_document
---

ALWAYS write your name and student number

Reading and writing YAML in R

library(yaml)
write_yaml(people, "datafile.yml")
persons <- read_yaml("datafile.yml")
persons

$Ali
$Ali$age
[1] 18

$Ali$sex
[1] "M"


$Bahar
$Bahar$age
[1] 19

$Bahar$sex
[1] "F"


$valid
[1]  TRUE FALSE

Use YAML for bibliography

references:
- type: article-journal
  id: WatsonCrick1953
  title: 'Molecular structure of nucleic acids: a structure for
    deoxyribose nucleic acid'
  author:
  - family: Watson
    given: J. D.
  - family: Crick
    given: F. H. C.
  container-title: Nature
  volume: 171
  issue: 4356
  page: 737-738
  issued:
    date-parts:
    - - 1953
      - 4
      - 25

How to use it

Put all the references somewhere in the document, with --- before and after.

[@WatsonCrick1953] produces (Watson and Crick 1953)
[@WatsonCrick1953, pp. 33-35, 38-39] becomes (Watson and Crick 1953, 33–35, 38–39).
[@WatsonCrick1953; @Collado-Vides2009a] becomes (Watson and Crick 1953; Collado-Vides et al. 2009).
@WatsonCrick1953 [p. 33] says blah becomes Watson and Crick (1953, 33) says blah

External bibliographies

If you have a long list of all papers, and you use it on several documents, then you should put the references in a separate file

Then you write

bibliography: references.yml

in the document metadata

Bibliography at the end of document

Bibliographies will be placed at the end of the document. Normally, you will want to end your document like this:

last paragraph...

# References

The bibliography will be inserted after this header. More info at

http://rmarkdown.rstudio.com/ authoring_bibliographies_and_citations.html

References

Collado-Vides, J, H Salgado, E Morett, S Gama-Castro, V Jiménez-Jacinto, I Martínez-Flores, A Medina-Rivera, L Muñiz-Rascado, M Peralta-Gil, and A Santos-Zavaleta. 2009. “Bioinformatics Resources for the Study of Gene Regulation in Bacteria.” Journal of Bacteriology 191 (1): 23–31.

Watson, J. D., and F. H. C. Crick. 1953. “Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid.” Nature 171 (4356): 737–38. https://doi.org/10.1038/171737a0.