November 26, 2019
We have our own data
survey <- read.table("survey1-tidy.txt")
survey$weight
[1] 67.0 58.0 56.0 94.0 60.0 77.0 56.0 75.0 [9] 80.0 105.0 59.0 70.0 57.0 50.0 78.0 55.0 [17] 106.0 68.0 68.0 65.0 76.0 42.5 55.0 69.0 [25] 60.0 58.0 52.0 47.0 65.0 67.0 68.0 74.0 [33] 55.0 55.0 60.0 50.0 55.0 58.0 75.0 53.0 [41] 81.0 54.0 55.0 72.0 65.0 64.0 54.0 85.0 [49] 63.0 75.0 77.0
weight?sort(survey$weight)
[1] 42.5 47.0 50.0 50.0 52.0 53.0 54.0 54.0 [9] 55.0 55.0 55.0 55.0 55.0 55.0 56.0 56.0 [17] 57.0 58.0 58.0 58.0 59.0 60.0 60.0 60.0 [25] 63.0 64.0 65.0 65.0 65.0 67.0 67.0 68.0 [33] 68.0 68.0 69.0 70.0 72.0 74.0 75.0 75.0 [41] 75.0 76.0 77.0 77.0 78.0 80.0 81.0 85.0 [49] 94.0 105.0 106.0
sort(survey$weight, decreasing=TRUE)
[1] 106.0 105.0 94.0 85.0 81.0 80.0 78.0 77.0 [9] 77.0 76.0 75.0 75.0 75.0 74.0 72.0 70.0 [17] 69.0 68.0 68.0 68.0 67.0 67.0 65.0 65.0 [25] 65.0 64.0 63.0 60.0 60.0 60.0 59.0 58.0 [33] 58.0 58.0 57.0 56.0 56.0 55.0 55.0 55.0 [41] 55.0 55.0 55.0 54.0 54.0 53.0 52.0 50.0 [49] 50.0 47.0 42.5
The command sort() works only for vectors
To sort a data frame, we first need to choose which column we use to order
We know the position of the smallest and the largest
which.min(survey$weight)
[1] 22
which.max(survey$weight)
[1] 17
We need the positions in between
For that we use the order() command
survey$weight
[1] 67.0 58.0 56.0 94.0 60.0 77.0 56.0 75.0 [9] 80.0 105.0 59.0 70.0 57.0 50.0 78.0 55.0 [17] 106.0 68.0 68.0 65.0 76.0 42.5 55.0 69.0 [25] 60.0 58.0 52.0 47.0 65.0 67.0 68.0 74.0 [33] 55.0 55.0 60.0 50.0 55.0 58.0 75.0 53.0 [41] 81.0 54.0 55.0 72.0 65.0 64.0 54.0 85.0 [49] 63.0 75.0 77.0
order() to sort a data frameorder(survey$weight)
[1] 22 28 14 36 27 40 42 47 16 23 33 34 37 43 3 7 13 [18] 2 26 38 11 5 25 35 49 46 20 29 45 1 30 18 19 31 [35] 24 12 44 32 8 39 50 21 6 51 15 9 41 48 4 10 17
This gives us the position of the smallest, the second smallest, and so on up to the largest
survey[order(survey$weight),] Gender birth_day birth_month birth_year height_cm weight_kg handness
st22 Female 13 10 1997 155 42.5 Right
st28 Female 7 7 1997 166 47.0 Right
st14 Female 3 7 1997 160 50.0 Right
st36 Female 24 3 1998 167 50.0 Right
st27 Female 13 10 1997 171 52.0 Right
st40 Female 5 2 1998 157 53.0 Right
st42 Female 18 5 1997 165 54.0 Right
st47 Female 29 7 1997 160 54.0 Right
st16 Female 3 9 2018 164 55.0 Right
st23 Female 2 10 1998 172 55.0 Right
st33 Female 21 5 1998 168 55.0 Right
st34 Female 3 9 1998 174 55.0 Right
st37 Female 17 9 1998 173 55.0 Right
st43 Female 23 5 1999 178 55.0 Right
st3 Female 28 1 1995 170 56.0 Left
st7 Female 5 4 1996 173 56.0 Right
st13 Female 9 6 1998 158 57.0 Right
st2 Female 9 10 1995 167 58.0 Right
st26 Female 17 5 1998 165 58.0 Right
st38 Female 2 1 1999 162 58.0 Right
st11 Male 26 12 1997 176 59.0 Right
st5 Female 1 1 1991 160 60.0 Right
st25 Female 17 8 1998 163 60.0 Right
st35 Female 1 9 1998 174 60.0 Right
st49 Female 2 5 1999 165 63.0 Left
st46 Male 6 11 1998 163 64.0 Right
st20 Female 30 6 1997 158 65.0 Right
st29 Male 28 7 1998 185 65.0 Left
st45 Male 6 12 1997 166 65.0 Right
st1 Male 1 2 1993 179 67.0 Right
st30 Male 5 1 1997 178 67.0 Right
st18 Female 16 11 1998 163 68.0 Right
st19 Female 3 5 1998 162 68.0 Right
st31 Male 27 11 1997 180 68.0 Right
st24 Female 10 6 1998 159 69.0 Right
st12 Male 9 2 1997 183 70.0 Right
st44 Female 19 9 1997 174 72.0 Right
st32 Male 29 8 1998 170 74.0 Right
st8 Female 14 1 1997 162 75.0 Left
st39 Male 19 11 1998 175 75.0 Right
st50 Male 31 10 1998 184 75.0 Right
st21 Male 15 1 2018 175 76.0 Right
st6 Male 26 9 1996 175 77.0 Right
st51 Male 9 3 1996 177 77.0 Right
st15 Male 13 10 1998 182 78.0 Right
st9 Male 1 5 1997 173 80.0 Right
st41 Male 18 5 1997 181 81.0 Right
st48 Male 14 3 1993 195 85.0 Right
st4 Male 11 8 1992 180 94.0 Right
st10 Male 25 6 1997 188 105.0 Right
st17 Male 10 1 1998 175 106.0 Right
hand_span_cm
st22 20
st28 20
st14 15
st36 30
st27 25
st40 20
st42 18
st47 20
st16 20
st23 20
st33 14
st34 22
st37 8
st43 12
st3 18
st7 21
st13 19
st2 18
st26 19
st38 19
st11 24
st5 19
st25 15
st35 24
st49 17
st46 15
st20 8
st29 22
st45 15
st1 15
st30 24
st18 13
st19 13
st31 19
st24 18
st12 20
st44 16
st32 25
st8 18
st39 20
st50 22
st21 20
st6 18
st51 23
st15 21
st9 16
st41 20
st48 30
st4 25
st10 20
st17 15
()library()install.packages()knitr: a package for RmarkdownKnitr is the system that merges R code and Markdown to produce documents that depend on data
It has many functions. We used two of them:
knitr::kable() is a function to produce nicer tables
pander() from the pander packageknitr::opts_chunk$set() to set the default options for each chunkkable()survey[1:5,]
Gender birth_day birth_month birth_year height_cm weight_kg handness
st1 Male 1 2 1993 179 67 Right
st2 Female 9 10 1995 167 58 Right
st3 Female 28 1 1995 170 56 Left
st4 Male 11 8 1992 180 94 Right
st5 Female 1 1 1991 160 60 Right
hand_span_cm
st1 15
st2 18
st3 18
st4 25
st5 19
kable()library(knitr) kable(survey[1:5,])
| Gender | birth_day | birth_month | birth_year | height_cm | weight_kg | handness | hand_span_cm | |
|---|---|---|---|---|---|---|---|---|
| st1 | Male | 1 | 2 | 1993 | 179 | 67 | Right | 15 |
| st2 | Female | 9 | 10 | 1995 | 167 | 58 | Right | 18 |
| st3 | Female | 28 | 1 | 1995 | 170 | 56 | Left | 18 |
| st4 | Male | 11 | 8 | 1992 | 180 | 94 | Right | 25 |
| st5 | Female | 1 | 1 | 1991 | 160 | 60 | Right | 19 |
library(knitr) before using any function of the packageX: drive (when using lab computers)Using the data from the exam
world
income population area 1 1810 31700000 653000 2 10500 2920000 28800 3 13300 38300000 2380000 4 6190 26000000 1250000 5 18900 97800 440 6 19500 42500000 2780000
value columnvariable value 1 income 1810 2 income 10500 3 income 13300 4 income 6190 5 income 18900 6 income 19500 7 population 31700000 8 population 2920000 9 population 38300000 10 population 26000000 11 population 97800 12 population 42500000 13 area 653000 14 area 28800 15 area 2380000 16 area 1250000 17 area 440 18 area 2780000
We use the reshape2 library
melt takes wide-format data and melts it into long-format data.
cast takes long-format data and casts it into wide-format data.
Think of working with metal:
library(reshape2) melt(world, id=NULL)
variable value 1 income 1810 2 income 10500 3 income 13300 4 income 6190 5 income 18900 6 income 19500 7 population 31700000 8 population 2920000 9 population 38300000 10 population 26000000 11 population 97800 12 population 42500000 13 area 653000 14 area 28800 15 area 2380000 16 area 1250000 17 area 440 18 area 2780000
Consider this case
countries
country income population area 1 Afghanistan 1810 31700000 653000 2 Albania 10500 2920000 28800 3 Algeria 13300 38300000 2380000 4 Angola 6190 26000000 1250000 5 Antigua and Barbuda 18900 97800 440 6 Argentina 19500 42500000 2780000
country is the identifierlibrary(reshape2) melt(countries, id="country")
country variable value 1 Afghanistan income 1810 2 Albania income 10500 3 Algeria income 13300 4 Angola income 6190 5 Antigua and Barbuda income 18900 6 Argentina income 19500 7 Afghanistan population 31700000 8 Albania population 2920000 9 Algeria population 38300000 10 Angola population 26000000 11 Antigua and Barbuda population 97800 12 Argentina population 42500000 13 Afghanistan area 653000 14 Albania area 28800 15 Algeria area 2380000 16 Angola area 1250000 17 Antigua and Barbuda area 440 18 Argentina area 2780000
reshape2 has several cast functions, for different structuresdcastacast for vector, matrix, or arraylong <- melt(countries, id="country") long
country variable value 1 Afghanistan income 1810 2 Albania income 10500 3 Algeria income 13300 4 Angola income 6190 5 Antigua and Barbuda income 18900 6 Argentina income 19500 7 Afghanistan population 31700000 8 Albania population 2920000 9 Algeria population 38300000 10 Angola population 26000000 11 Antigua and Barbuda population 97800 12 Argentina population 42500000 13 Afghanistan area 653000 14 Albania area 28800 15 Algeria area 2380000 16 Angola area 1250000 17 Antigua and Barbuda area 440 18 Argentina area 2780000
dcast(long, country~variable)
country income population area 1 Afghanistan 1810 31700000 653000 2 Albania 10500 2920000 28800 3 Algeria 13300 38300000 2380000 4 Angola 6190 26000000 1250000 5 Antigua and Barbuda 18900 97800 440 6 Argentina 19500 42500000 2780000
So far all the files we have used is structured
That is, they have rows and columns
We use read.table and write.table to read and write a data frame
Sometimes the data is not a table
people <- list(Ali=list(age=18, sex='M'), Bahar=list(age=19, sex='F'), valid=c(TRUE,FALSE)) people
$Ali $Ali$age [1] 18 $Ali$sex [1] "M" $Bahar $Bahar$age [1] 19 $Bahar$sex [1] "F" $valid [1] TRUE FALSE
How can we read and write lists?
There are several options to store lists into files.
A good one is YAML, which looks like this:
Ali: age: 18.0 sex: M Bahar: age: 19.0 sex: F valid: - yes - no
:---- before and after the YAML codeGoogle “YAML” for more info
We use YAML for the Rmarkdown metadata. For example
--- title: "Midterm Exam" subtitle: "Computing in Molecular Biology 1" author: "Put your name here" number: STUDENT_NUMBER date: "October 25, 2018" output: html_document ---
library(yaml)
write_yaml(people, "datafile.yml")
persons <- read_yaml("datafile.yml")
persons
$Ali $Ali$age [1] 18 $Ali$sex [1] "M" $Bahar $Bahar$age [1] 19 $Bahar$sex [1] "F" $valid [1] TRUE FALSE
references:
- type: article-journal
id: WatsonCrick1953
title: 'Molecular structure of nucleic acids: a structure for
deoxyribose nucleic acid'
author:
- family: Watson
given: J. D.
- family: Crick
given: F. H. C.
container-title: Nature
volume: 171
issue: 4356
page: 737-738
issued:
date-parts:
- - 1953
- 4
- 25
Put all the references somewhere in the document, with --- before and after.
[@WatsonCrick1953] produces (Watson and Crick 1953)[@WatsonCrick1953, pp. 33-35, 38-39] becomes (Watson and Crick 1953, 33–35, 38–39).[@WatsonCrick1953; @Collado-Vides2009a] becomes (Watson and Crick 1953; Collado-Vides et al. 2009).@WatsonCrick1953 [p. 33] says blah becomes Watson and Crick (1953, 33) says blahIf you have a long list of all papers, and you use it on several documents, then you should put the references in a separate file
Then you write
bibliography: references.yml
in the document metadata
Bibliographies will be placed at the end of the document. Normally, you will want to end your document like this:
last paragraph... # References
The bibliography will be inserted after this header. More info at
http://rmarkdown.rstudio.com/ authoring_bibliographies_and_citations.html
Collado-Vides, J, H Salgado, E Morett, S Gama-Castro, V Jiménez-Jacinto, I Martínez-Flores, A Medina-Rivera, L Muñiz-Rascado, M Peralta-Gil, and A Santos-Zavaleta. 2009. “Bioinformatics Resources for the Study of Gene Regulation in Bacteria.” Journal of Bacteriology 191 (1): 23–31.
Watson, J. D., and F. H. C. Crick. 1953. “Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid.” Nature 171 (4356): 737–38. https://doi.org/10.1038/171737a0.