Class 20: Factors

Computing in Molecular Biology and Genetics 1

Andrés Aravena, PhD

21 December 2020

Students data

Today we will use the clean version of our data

library(dplyr)
library(readr)
students <- read_tsv("students2018-2020-tidy.tsv")
colnames(students)
 [1] "answer_date"   "id"            "english_level" "sex"          
 [5] "birthdate"     "birthplace"    "height_cm"     "weight_kg"    
 [9] "handedness"    "hand_span"    

There are 10 columns

Column birthplace was a problem

We got different names for the same city

It took a lot of work to correct it

It would be better to enforce using only one standard name for each city

That is, to use a controlled vocabulary

Clean birthplace

students$birthplace
 [1] "-/Turkey"             "Kahramanmaraş/Turkey" "Batman/Turkey"       
 [4] "Antalya/Turkey"       "Izmir/Turkey"         "Yalova/Turkey"       
 [7] "Adıyaman/Turkey"      "Bursa/Turkey"         "Istanbul/Turkey"     
[10] "Istanbul/Turkey"      "Van/Turkey"           NA                    
[13] NA                     "Istanbul/Turkey"      "Istanbul/Turkey"     
[16] "Samsun/Turkey"        "Mardin/Turkey"        "Gaziantep/Turkey"    
[19] "Istanbul/Turkey"      "Bursa/Turkey"         "Istanbul/Turkey"     
[22] "Bursa/Turkey"         "Yalova/Turkey"        "Ordu/Turkey"         
[25] "Istanbul/Turkey"      "Istanbul/Turkey"      "Edirne/Turkey"       
[28] "Malatya/Turkey"       NA                     "Hatay/Turkey"        

(For this class we show only the first 30 values)

Factors enforce a controlled vocabulary

There is another data type in R, called factors

They are also known as categorical variables

They are used for discrete values, for example when there is no natural order

  • Color
  • Sex
  • Country of Origin

These are variables that you would never average

Factors can take values only from a list of valid levels

Factor vectors

To make a vector of factors we start with a character vector

factor(students$birthplace)
 [1] -/Turkey             Kahramanmaraş/Turkey Batman/Turkey       
 [4] Antalya/Turkey       Izmir/Turkey         Yalova/Turkey       
 [7] Adıyaman/Turkey      Bursa/Turkey         Istanbul/Turkey     
[10] Istanbul/Turkey      Van/Turkey           <NA>                
[13] <NA>                 Istanbul/Turkey      Istanbul/Turkey     
[16] Samsun/Turkey        Mardin/Turkey        Gaziantep/Turkey    
[19] Istanbul/Turkey      Bursa/Turkey         Istanbul/Turkey     
[22] Bursa/Turkey         Yalova/Turkey        Ordu/Turkey         
[25] Istanbul/Turkey      Istanbul/Turkey      Edirne/Turkey       
[28] Malatya/Turkey       <NA>                 Hatay/Turkey        
39 Levels: -/Azerbaijan -/Syria -/Turkey -/Turkmenistan ... Yalova/Turkey

Notice that there are no " marks,
and there is a line describing the levels

Let’s add a new column

To see the difference between text and factor, we will add a new column called place_factor

students$place_factor <- factor(students$birthplace)
colnames(students)
 [1] "answer_date"   "id"            "english_level" "sex"          
 [5] "birthdate"     "birthplace"    "height_cm"     "weight_kg"    
 [9] "handedness"    "hand_span"     "place_factor" 

They have different summary()

Let’s compare text and factor vectors with the same data

students %>% select(birthplace, place_factor) %>% summary()
  birthplace                 place_factor
 Length:117         Istanbul/Turkey:35   
 Class :character   Bursa/Turkey   : 7   
 Mode  :character   Tekirdağ/Turkey: 4   
                    Yalova/Turkey  : 4   
                    -/Turkey       : 3   
                    (Other)        :57   
                    NA's           : 7   

In this case factors are more useful

Factors will be important in the following classes

Literal meaning of “factor”

Factor is a latin word, form facere (doing)

A factor is someone doing an action

More general, a factor is something that has an effect on another thing

Factors are essential in biology

The name “factor” was used first by plant researchers to describe the things that affect the growth of plants

  • Water
  • Sun
  • Soil

Levels

Let’s look again the last line of place_factor

 [1] -/Turkey             Kahramanmaraş/Turkey Batman/Turkey       
 [4] Antalya/Turkey       Izmir/Turkey         Yalova/Turkey       
 [7] Adıyaman/Turkey      Bursa/Turkey         Istanbul/Turkey     
[10] Istanbul/Turkey      Van/Turkey           <NA>                
[13] <NA>                 Istanbul/Turkey      Istanbul/Turkey     
[16] Samsun/Turkey        Mardin/Turkey        Gaziantep/Turkey    
[19] Istanbul/Turkey      Bursa/Turkey         Istanbul/Turkey     
39 Levels: -/Azerbaijan -/Syria -/Turkey -/Turkmenistan ... Yalova/Turkey

Printing a factor will show what are the valid values

The valid values are called levels

Asking for the levels

We can ask what are the levels of a factor

levels(students$place_factor)
 [1] "-/Azerbaijan"          "-/Syria"               "-/Turkey"             
 [4] "-/Turkmenistan"        "Adana/Turkey"          "Adıyaman/Turkey"      
 [7] "Afyonkarahisar/Turkey" "Aleppo/Syria"          "Almaty/Kazakhstan"    
[10] "Ankara/Turkey"         "Antalya/Turkey"        "Aydın/Turkey"         
[13] "Balıkesir/Turkey"      "Batman/Turkey"         "Bursa/Turkey"         
[16] "Çorum/Turkey"          "Edirne/Turkey"         "Gaziantep/Turkey"     
[19] "Hannover/Germany"      "Hatay/Turkey"          "Istanbul/Turkey"      
[22] "Izmir/Turkey"          "Kahramanmaraş/Turkey"  "Karabük/Turkey"       
[25] "Kırklareli/Turkey"     "Konya/Turkey"          "Malatya/Turkey"       
[28] "Manisa/Turkey"         "Mardin/Turkey"         "Mersin/Turkey"        
[31] "Muğla/Turkey"          "Nakhchivan/Azerbaijan" "Ordu/Turkey"          
[34] "Samsun/Turkey"         "Sivas/Turkey"          "Tekirdağ/Turkey"      
[37] "Tunceli/Turkey"        "Van/Turkey"            "Yalova/Turkey"        

This is a character vector

By default the levels are all the different vector values, sorted alphabetically

Choosing the valid levels

We can decide the levels when we create the factor

colors <- factor(c("black", "black","black"),
                 levels=c("black", "blue", "white"))
colors
[1] black black black
Levels: black blue white

In this case we know all possible levels, even if not all are present in the character vector

Assigning new levels’ names

We can give new names to the levels

levels(colors) <- c("siyah", "mavi", "beyaz")
colors
[1] siyah siyah siyah
Levels: siyah mavi beyaz

The factor is the same, we only change the levels’ names

Levels have an order

students$english_factor <- factor(students$english_level)
summary(students$english_factor)
             English is my native language 
                                         4 
I can read and understand technical papers 
                                        56 
                      I can speak fluently 
                                        18 
 I can understand movies without subtitles 
                                        26 
I can write poetry better than Shakespeare 
                                         1 
                      İngilizce bilmiyorum 
                                        12 

The result is not ordered by “level of knowledge”.
We do not want alphabetic order in this case.

Changing the order of levels

We can re-code the factor levels in the order we want

students$english_factor <- factor(students$english_factor,
    levels=c("İngilizce bilmiyorum",
        "I can read and understand technical papers",
        "I can understand movies without subtitles", 
        "I can speak fluently",  
        "English is my native language", 
        "I can write poetry better than Shakespeare"))
summary(students$english_factor)
                      İngilizce bilmiyorum 
                                        12 
I can read and understand technical papers 
                                        56 
 I can understand movies without subtitles 
                                        26 
                      I can speak fluently 
                                        18 
             English is my native language 
                                         4 
I can write poetry better than Shakespeare 
                                         1 

When should we use text

Text is not very efficient

"123" takes 3 times more memory than 123

We use text when there is no better option Used for that that does not repeat a lot

  • Name of people
  • Sample ID
  • Student number

Factors use less memory

Inside the computer, factors are encoded as numbers

students$place_factor
 [1] -/Turkey             Kahramanmaraş/Turkey Batman/Turkey       
 [4] Antalya/Turkey       Izmir/Turkey         Yalova/Turkey       
 [7] Adıyaman/Turkey      Bursa/Turkey         Istanbul/Turkey     
[10] Istanbul/Turkey      Van/Turkey           <NA>                
[13] <NA>                 Istanbul/Turkey      Istanbul/Turkey     
[16] Samsun/Turkey        Mardin/Turkey        Gaziantep/Turkey    
[19] Istanbul/Turkey      Bursa/Turkey         Istanbul/Turkey     
39 Levels: -/Azerbaijan -/Syria -/Turkey -/Turkmenistan ... Yalova/Turkey
as.numeric(students$place_factor)
 [1]  3 23 14 11 22 39  6 15 21 21 38 NA NA 21 21 34 29 18 21 15 21

Classic R versus tidiverse

Factors are so useful that classic R functions like read.table() produces factors instead of text

In the tidyverse we can choose which columns are text and which ones are factors

Summary

  • We have seen three data types in R
    • Numeric
    • Logic
    • Text
  • Factors are a fourth data type
    • Do not show " when printed
    • Can take only some values
  • The levels are the only valid values