October 10th, 2016

Using R and RStudio

Analyzing Data

for fun and profit

Many disciplines, including Molecular Biology and Genetics, have become more and more data driven.

Starting now, we will use RStudio, a free software for data analysis

Most users of R are molecular biologists, but it is also used by economists, psychologists and marketing specialists

How to use RStudio

You have to install R and RStudio in your computer

You have to execute RStudio. Then

  • We read data from one or more files
  • We transform this data according to a program we design
  • We write the results to new files

Command line

RStudio, as almost all serious programs, is controlled by the keyboard

The mouse can be used for some shortcuts, but the real deal is the keyboard

A goal of this course is to become comfortable with the keyboard

These tools are for people who read books and don’t watch TV

The keyboard

your real friend

Talking with the computer

R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

>

This > symbol is called prompt

prompt [präm(p)t]

verb

  • Assist or encourage (a hesitating speaker) to say something: “And the picture?” he prompted.
  • Computing (of a computer) request input from (a user).

From “New Oxford American Dictionary”

An interactive session

  • The computer shows the prompt
  • You write some commands using the keyboard
  • You finish by pressing Enter or Return
  • The computer executes your commands
  • When the execution finishes you get a new prompt

and repeat

Tab is your friend

In Rstudio you can press TAB and get superpowers!

  • The computer will propose alternatives depending on the context
  • You can select the good one using the arrows
  • If there is only one option then it is completed automatically
  • You write faster and make less mistakes

You can also repeat and edit previous commands using the arrows

You can delete all the line using Escape

Learning a new Language

beyond English

Basic Rules of a Language

Each phrase in a program is imperative.

Involves nouns, verbs and adverbs

Today we will focus on nouns

The first verb we need today is assign <-

Data represent objects

  • We know that computers store numbers
  • The numbers represent other things
  • What they represent depends on the type of the object
  • How they are used depend on the structure of the object

Objects

Every object in R has 2 important properties:

Type
What does it represent
Structure
How can we read and modify parts of it

Basic Objects

Nouns are names of objects

To handle objects we give them names

We “store” the objects in variables

If we don’t give a name to an object, it is lost for ever

Vectors

The most simple objects in R

> rivers
  [1]  735  320  325  392  524  450 1459  135  465  600  330  336  280  315
 [15]  870  906  202  329  290 1000  600  505 1450  840 1243  890  350  407
 [29]  286  280  525  720  390  250  327  230  265  850  210  630  260  230
 [43]  360  730  600  306  390  420  291  710  340  217  281  352  259  250
 [57]  470  680  570  350  300  560  900  625  332 2348 1171 3710 2315 2533
 [71]  780  280  410  460  260  255  431  350  760  618  338  981 1306  500
 [85]  696  605  250  411 1054  735  233  435  490  310  460  383  375 1270
 [99]  545  445 1885  380  300  380  377  425  276  210  800  420  350  360
[113]  538 1100 1205  314  237  610  360  540 1038  424  310  300  444  301
[127]  268  620  215  652  900  525  246  360  529  500  720  270  430  671
[141] 1770

Vectors

  • Group of values, all with the same type
  • Basic types are
    • Character
    • Numeric
    • Logic
    • Factor

Factors

Also known as categorical variables.

They are used for discrete values, for example when there is no natural order

  • Color
  • Gender/Sex
  • Country of Origin

These are variables that you would never average

Example: character vector

US States

> state.name
 [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
 [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
 [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
[13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
[17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
[21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
[25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
[29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
[33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
[37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
[41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
[45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
[49] "Wisconsin"      "Wyoming"       

Example: character vector

US States

> state.abb
 [1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "ID" "IL" "IN"
[15] "IA" "KS" "KY" "LA" "ME" "MD" "MA" "MI" "MN" "MS" "MO" "MT" "NE" "NV"
[29] "NH" "NJ" "NM" "NY" "NC" "ND" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN"
[43] "TX" "UT" "VT" "VA" "WA" "WV" "WI" "WY"

Example: numeric vector

US States

> state.area
 [1]  51609 589757 113909  53104 158693 104247   5009   2057  58560  58876
[11]   6450  83557  56400  36291  56290  82264  40395  48523  33215  10577
[21]   8257  58216  84068  47716  69686 147138  77227 110540   9304   7836
[31] 121666  49576  52586  70665  41222  69919  96981  45333   1214  31055
[41]  77047  42244 267339  84916   9609  40815  68192  24181  56154  97914

Example: logic vector

US States

> state.area > 80000
 [1] FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
[12]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[23]  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
[34] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
[45] FALSE FALSE FALSE FALSE FALSE  TRUE

Example: factor vector

US States

> state.region
 [1] South         West          West          South         West         
 [6] West          Northeast     South         South         South        
[11] West          West          North Central North Central North Central
[16] North Central South         South         Northeast     South        
[21] Northeast     North Central North Central South         North Central
[26] West          North Central West          Northeast     Northeast    
[31] West          Northeast     South         North Central North Central
[36] South         West          Northeast     Northeast     South        
[41] North Central South         South         West          Northeast    
[46] South         West          South         North Central West         
Levels: Northeast South North Central West

Creating vectors

Simple concatenation

> c(1,2,3)
[1] 1 2 3
> c(10,20)
[1] 10 20

The function c() takes many values and makes a single vector

All values should be of the same type

Question for home: what happen if they have different type?

Storing vectors in variables

> x <- c(1,2,3)
> y <- c(10,20)

We use the <- operator for assignment.

> x
[1] 1 2 3
> y
[1] 10 20

Vectors can also be concatenated

x and y are two numeric vectors. We can concatenate them

> c(x, y, 5)
[1]  1  2  3 10 20  5

Creating Logical Vectors

> c(TRUE, TRUE, FALSE, TRUE)
[1]  TRUE  TRUE FALSE  TRUE

We can also write c(T,T,F,T)

Creating Logical Vectors

A comparison creates a logical vector

> weight <- c(60, 72, 57, 90, 95, 72)
> weight > 25
[1] TRUE TRUE TRUE TRUE TRUE TRUE

Character vectors

Same idea. Concatenation

Each element must be between single or double quotes

> c("alpha", 'beta', "gamma")
[1] "alpha" "beta"  "gamma"
> c('he said "yes"', "I don't know")
[1] "he said \"yes\"" "I don't know"   

Special characters are coded with two symbols: \", \\, \n, \t

Factor vectors

Easy. Any character vector can be transformed into a factor

> chr.vector <- c("female", "male", "male", "female", "male", "male", "female", "female")
> chr.vector
[1] "female" "male"   "male"   "female" "male"   "male"   "female" "female"
> fact.vector <-factor(chr.vector)
> fact.vector
[1] female male   male   female male   male   female female
Levels: female male

Sequences

> 4:9
[1] 4 5 6 7 8 9
> seq(4,9)
[1] 4 5 6 7 8 9
> seq(4,10,2)
[1]  4  6  8 10
> seq(from=4, by=2, length=4)
[1]  4  6  8 10

Repetitions

> rep(1,3)
[1] 1 1 1
> rep(c(7,9,13), 3)
[1]  7  9 13  7  9 13  7  9 13
> rep(c(7,9,13), 1:3)
[1]  7  9  9 13 13 13

Repetitions

> rep(1:2,c(10,5))
 [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2
> rep(c(TRUE,FALSE),3)
[1]  TRUE FALSE  TRUE FALSE  TRUE FALSE
> rep(c(TRUE,FALSE),c(3,3))
[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE

Missing data

  • In practice there are cases when a datum is not present
  • It is not a good idea to use a fictitious value
  • The symbol NA is used in that case
  • You can use it on any vector, regardless of type
> c(NA,TRUE, FALSE)
[1]    NA  TRUE FALSE
> c(NA,1,2)
[1] NA  1  2

Names of elements

Every element can have a name

> weight <- c(Ali=60, Deniz=72, Fatma=57, Emre=90, Volkan=95, Onur=72)
> names(weight)
[1] "Ali"    "Deniz"  "Fatma"  "Emre"   "Volkan" "Onur"  
> height <- c(1.75,1.80,1.65,1.90,1.74, 1.91)
> names(height) <- names(weight)

Accessing elements

To get the i-th element of a vector v we use v[i]

> weight[3]
Fatma 
   57 
> weight
   Ali  Deniz  Fatma   Emre Volkan   Onur 
    60     72     57     90     95     72 

The index can be a numeric vector

> weight[c(1,3,5)]
   Ali  Fatma Volkan 
    60     57     95 
> weight[2:4]
Deniz Fatma  Emre 
   72    57    90 

Negative Indices

Used to indicate omitted elements

> weight
   Ali  Deniz  Fatma   Emre Volkan   Onur 
    60     72     57     90     95     72 
> weight[c(-1,-3,-5)]
Deniz  Emre  Onur 
   72    90    72 

Useful when you need almost all elements

Logical Indices

Can be indexed by a logical vector

Must be of the same length of the vector

> weight>72
   Ali  Deniz  Fatma   Emre Volkan   Onur 
 FALSE  FALSE  FALSE   TRUE   TRUE  FALSE 
> weight[weight>72]
  Emre Volkan 
    90     95 

Names as Indices

If a vector has names, we can use them:

> weight[c("Deniz", "Volkan", "Fatma")]
 Deniz Volkan  Fatma 
    72     95     57 
  • How do we know if a vector has names?
> names(vector)
NULL

About the homework

More information

read and learn it