Class 3: Lists with names, and CG content

Computing for Molecular Biology 2

Andrés Aravena, PhD

19 March 2021

Answers to the quiz

What is the data type of this R value?

 [1] "M" "C" "B" "Z" "I" "D" "G" "W" "H" "O"
  • 9 character
  • 1 character vector
  • 1 c
  • 1 integer
  • 1 numeric
  • 1 text
  • 1 word

What is the data structure of this R value?

 [1] "M" "C" "B" "Z" "I" "D" "G" "W" "H" "O"
  • 6 vector
  • 2 list
  • 2 character
  • 1 string
  • 1 m
  • 1 hyerarchial
  • 1 I don’t know
  • 1 sorry, I don’t know

It is a vector of characters

c("M", "C", "B", "Z", "I", "D", "G", "W", "H", "O")
 [1] "M" "C" "B" "Z" "I" "D" "G" "W" "H" "O"

R has four basic data types: numeric, logic, character, factor.

Only characters use quotes "

R data structures include vectors and data frames, among others

Vectors show [1] to the left

What is the data type of this R value?

 G  Z  P  I  A  V  D  Y  C  H 
 9  2  3 10  8  6  4  7  1  5 
  • 3 character and numeric
  • 3 logical
  • 1 numeric
  • 1 list
  • 1 integer
  • 1 factor (i am not sure)
  • 1 data frame
  • 1 column
  • 1 c1
  • 2 “sorry, i don’t know”

What is the data structure of this R value?

 G  Z  P  I  A  V  D  Y  C  H 
 9  2  3 10  8  6  4  7  1  5 
  • 3 data frame
  • 1 table
  • 3 matrix
  • 1 list
  • 1 factor
  • 1 data frame
  • 1 d4
  • 1 column
  • 1 character and numeric
  • 2 “sorry, i don’t know”

It is a named vector

b <- c(9, 2, 3, 10, 8, 6, 4, 7, 1, 5)
names(b) <- c("G", "Z", "P", "I", "A", "V", "D", "Y", "C", "H")
b
 G  Z  P  I  A  V  D  Y  C  H 
 9  2  3 10  8  6  4  7  1  5 

When vectors are named, they do not show [1]

Instead, they show each element’s name over value

What is the value of x[2:4]?

x <- c("i", "g", "e", "j", "k", "y")

 

  • 2 "g" "e" "j"
  • 2 g, j
  • 1 [1] g e j k
  • 1 "i, g, e, j"
  • 1 ["g","e","j","k"]
  • 1 "g", "e"

 

  • 1 g, e, j
  • 1 g j
  • 1 "i", "g","e","j"
  • 1 "g", "j", "k"
  • 1 "e", "j"
  • 1 "g", "j"
  • 1 factors

Just in case

x <- c("i", "g", "e", "j", "k", "y")
x[2:4]
[1] "g" "e" "j"

Everything between position 2 and 4

c(x[2], x[3], x[4])
[1] "g" "e" "j"

What is the value of a after all these steps?

a <- 2
b <- a
a <- 5
a <- a + b
  • 7 7
  • 5 10
  • 1 I dont know
  • 1 a <- 2+5
  • 1 -

How does it work

a <- 2       # a is 2
b <- a       # a is 2, b is 2
a <- 5       # a is 5, b is 2
a <- a + b   # a is 7, b is 2

Assignment copies values, and keep variables independent

Variables are not equal, they only have the same value

What is the result here?

v <- c(9, 4, 8, 1, 3, 2, 10)
sum(v>3)
  • 4 31
  • 2 33
  • 1 [1] 9 4 8 10
  • 4 9, 4, 8, 10
  • 1 (9, 4, 8, 10)
  • 1 "9", "4", "8", "10"
  • 1 4
  • 1 v <- c(4,8,9,10)- 7.75
  • 1 I dont know

This is important. We will use it

How many elements are greater than 3

v <- c(9, 4, 8, 1, 3, 2, 10)
v
[1]  9  4  8  1  3  2 10
v>3
[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE
sum(v>3)
[1] 4

What is this?

[[1]]
[1] "e" "y" "l" "s" "t" "o"

[[2]]
[1] "O" "R" "K" "C" "U" "W" "P" "T"
  • 5 -
  • 3 I don’t know
  • 3 vectors
  • 1 sublist of list
  • 1 this is resuls of the some list vector.
  • 1 this is a data. data type is characters and data structure is vactor.
  • 1 I dont remember at all

This is a list

list(c("e", "y", "l", "s", "t", "o"),
     c("O", "R", "K", "C", "U", "W", "P", "T"))
[[1]]
[1] "e" "y" "l" "s" "t" "o"

[[2]]
[1] "O" "R" "K" "C" "U" "W" "P" "T"

If you see double brackets [[]], then you see a list

Lists

Lists

Like vectors, but mixing different kinds of elements

people <- list(c(60, 72, 57, 90, 95, 72),
               c(1.75, 1.80, 1.65, 1.90, 1.74, 1.91),
               c("Ali", "Deniz", "Fatma", "Emre",
                 "Volkan", "Onur"),
               TRUE, 
               factor(c("M","F","F","M","M","M")))

Notice that elements can have different length

Result

people
[[1]]
[1] 60 72 57 90 95 72

[[2]]
[1] 1.75 1.80 1.65 1.90 1.74 1.91

[[3]]
[1] "Ali"    "Deniz"  "Fatma"  "Emre"   "Volkan" "Onur"  

[[4]]
[1] TRUE

[[5]]
[1] M F F M M M
Levels: F M

Visualization

Each list element starts with a number in double brackets

Inside each element, we can see vectors, lists or other things

When the element is a vector, we see a second number, in single brackets

[[1]]
[1] 60 72 57 90 95 72

[[2]]
[1] 1.75 1.80 1.65 1.90 1.74 1.91

Indexing Lists

  • Can be indexed same as vectors
  • Returns a sub-list
people[1:2]
[[1]]
[1] 60 72 57 90 95 72

[[2]]
[1] 1.75 1.80 1.65 1.90 1.74 1.91

Elements versus sublists

This is a sublist (with one element):

people[1]
[[1]]
[1] 60 72 57 90 95 72

This is an element:

people[[1]]
[1] 60 72 57 90 95 72

Lists elements can have names

people <- list(weight=c(60, 72, 57, 90, 95, 72),
               height=c(1.75, 1.80, 1.65, 1.90, 1.74, 1.91),
               names=c("Ali", "Deniz", "Fatma", "Emre",
                       "Volkan", "Onur"),
               valid=TRUE,
               gender=factor(c("M","F","F","M","M","M")))

How else can we assign names?

Lists with Names

people
$weight
[1] 60 72 57 90 95 72

$height
[1] 1.75 1.80 1.65 1.90 1.74 1.91

$names
[1] "Ali"    "Deniz"  "Fatma"  "Emre"   "Volkan" "Onur"  

$valid
[1] TRUE

$gender
[1] M F F M M M
Levels: F M

Indexing Lists with Names

  • Can be indexed same as vectors
  • Returns a sub-list
people[1:2]
$weight
[1] 60 72 57 90 95 72

$height
[1] 1.75 1.80 1.65 1.90 1.74 1.91

Elements of Lists with Names

This is a sublist:

people[1]
$weight
[1] 60 72 57 90 95 72

This is an element:

people[[1]]
[1] 60 72 57 90 95 72

Accessing single elements

people[[1]]
[1] 60 72 57 90 95 72
people[["weight"]]
[1] 60 72 57 90 95 72

Shortcut to index a single element

people$weight
[1] 60 72 57 90 95 72

Changing parts of a List

Indices can also be used to change specific parts of a list.

For example we can update the names

people$names <- toupper(people$names)
people$names
[1] "ALI"    "DENIZ"  "FATMA"  "EMRE"   "VOLKAN" "ONUR"  

Deleting list elements

people$valid <- NULL
people$YMD <- NULL
people
$weight
[1] 60 72 57 90 95 72

$height
[1] 1.75 1.80 1.65 1.90 1.74 1.91

$names
[1] "ALI"    "DENIZ"  "FATMA"  "EMRE"   "VOLKAN" "ONUR"  

$gender
[1] M F F M M M
Levels: F M

Adding new list elements

people$BMI <- people$weight/people$height^2
people
$weight
[1] 60 72 57 90 95 72

$height
[1] 1.75 1.80 1.65 1.90 1.74 1.91

$names
[1] "ALI"    "DENIZ"  "FATMA"  "EMRE"   "VOLKAN" "ONUR"  

$gender
[1] M F F M M M
Levels: F M

$BMI
[1] 19.59184 22.22222 20.93664 24.93075 31.37799 19.73630

Indexing Lists

  • List elements are indexed by [[]]
  • Sublists are indexed by []

Try these

people[[2]]
people[2]
people[[2]][3]
people[2][3]
people[[1:3]]
people[1:3]
people[["weight"]]
people$weight
people["weight"]

Result

people[[2]]
[1] 1.75 1.80 1.65 1.90 1.74 1.91
people[2]
$height
[1] 1.75 1.80 1.65 1.90 1.74 1.91

Result

people[[2]][3]
[1] 1.65
people[2][3]
$<NA>
NULL

Result

people[[1:3]]
Error in people[[1:3]]: recursive indexing failed at level 2
people[1:3]
$weight
[1] 60 72 57 90 95 72

$height
[1] 1.75 1.80 1.65 1.90 1.74 1.91

$names
[1] "ALI"    "DENIZ"  "FATMA"  "EMRE"   "VOLKAN" "ONUR"  

Result

people[["weight"]]
[1] 60 72 57 90 95 72
people$weight
[1] 60 72 57 90 95 72
people["weight"]
$weight
[1] 60 72 57 90 95 72

Reading FASTA files into lists

Read FASTA formatted files

library(seqinr)
proteins <- read.fasta("AP009180.faa", seqtype="AA", set.attributes = FALSE)
proteins[1:10]
$`lcl|AP009180.1_prot_BAF35032.1_1`
  [1] "M" "N" "T" "I" "F" "S" "R" "I" "T" "P" "L" "G" "N" "G" "T" "L"
 [17] "C" "V" "I" "R" "I" "S" "G" "K" "N" "V" "K" "F" "L" "I" "Q" "K"
 [33] "I" "V" "K" "K" "N" "I" "K" "E" "K" "I" "A" "T" "F" "S" "K" "L"
 [49] "F" "L" "D" "K" "E" "C" "V" "D" "Y" "A" "M" "I" "I" "F" "F" "K"
 [65] "K" "P" "N" "T" "F" "T" "G" "E" "D" "I" "I" "E" "F" "H" "I" "H"
 [81] "N" "N" "E" "T" "I" "V" "K" "K" "I" "I" "N" "Y" "L" "L" "L" "N"
 [97] "K" "A" "R" "F" "A" "K" "A" "G" "E" "F" "L" "E" "R" "R" "Y" "L"
[113] "N" "G" "K" "I" "S" "L" "I" "E" "C" "E" "L" "I" "N" "N" "K" "I"
[129] "L" "Y" "D" "N" "E" "N" "M" "F" "Q" "L" "T" "K" "N" "S" "E" "K"
[145] "K" "I" "F" "L" "C" "I" "I" "K" "N" "L" "K" "F" "K" "I" "N" "S"
[161] "L" "I" "I" "C" "I" "E" "I" "A" "N" "F" "N" "F" "S" "F" "F" "F"
[177] "F" "N" "D" "F" "L" "F" "I" "K" "Y" "T" "F" "K" "K" "L" "L" "K"
[193] "L" "L" "K" "I" "L" "I" "D" "K" "I" "T" "V" "I" "N" "Y" "L" "K"
[209] "K" "N" "F" "T" "I" "M" "I" "L" "G" "R" "R" "N" "V" "G" "K" "S"
[225] "T" "L" "F" "N" "K" "I" "C" "A" "Q" "Y" "D" "S" "I" "V" "T" "N"
[241] "I" "P" "G" "T" "T" "K" "N" "I" "I" "S" "K" "K" "I" "K" "I" "L"
[257] "S" "K" "K" "I" "K" "M" "M" "D" "T" "A" "G" "L" "K" "I" "R" "T"
[273] "K" "N" "L" "I" "E" "K" "I" "G" "I" "I" "K" "N" "I" "N" "K" "I"
[289] "Y" "Q" "G" "N" "L" "I" "L" "Y" "M" "I" "D" "K" "F" "N" "I" "K"
[305] "N" "I" "F" "F" "N" "I" "P" "I" "D" "F" "I" "D" "K" "I" "K" "L"
[321] "N" "E" "L" "I" "I" "L" "V" "N" "K" "S" "D" "I" "L" "G" "K" "E"
[337] "E" "G" "V" "F" "K" "I" "K" "N" "I" "L" "I" "I" "L" "I" "S" "S"
[353] "K" "N" "G" "T" "F" "I" "K" "N" "L" "K" "C" "F" "I" "N" "K" "I"
[369] "V" "D" "N" "K" "D" "F" "S" "K" "N" "N" "Y" "S" "D" "V" "K" "I"
[385] "L" "F" "N" "K" "F" "S" "F" "F" "Y" "K" "E" "F" "S" "C" "N" "Y"
[401] "D" "L" "V" "L" "S" "K" "L" "I" "D" "F" "Q" "K" "N" "I" "F" "K"
[417] "L" "T" "G" "N" "F" "T" "N" "K" "K" "I" "I" "N" "S" "C" "F" "R"
[433] "N" "F" "C" "I" "G" "K"

$`lcl|AP009180.1_prot_BAF35033.1_2`
  [1] "M" "N" "I" "F" "N" "I" "I" "I" "I" "G" "A" "G" "H" "S" "G" "I"
 [17] "E" "A" "A" "I" "S" "A" "S" "K" "I" "C" "N" "K" "I" "K" "I" "I"
 [33] "T" "S" "N" "L" "E" "N" "L" "G" "I" "M" "S" "C" "N" "P" "S" "I"
 [49] "G" "G" "I" "G" "K" "S" "H" "L" "V" "K" "E" "L" "E" "L" "F" "G"
 [65] "G" "I" "M" "P" "E" "A" "S" "D" "Y" "S" "R" "I" "H" "S" "K" "L"
 [81] "L" "N" "Y" "K" "K" "G" "E" "S" "V" "H" "S" "L" "R" "Y" "Q" "I"
 [97] "D" "R" "I" "L" "Y" "K" "N" "Y" "I" "L" "K" "I" "L" "F" "L" "K"
[113] "K" "N" "I" "L" "I" "E" "Q" "N" "E" "I" "N" "K" "I" "I" "R" "F"
[129] "K" "K" "K" "I" "L" "I" "F" "N" "K" "L" "K" "F" "F" "N" "I" "A"
[145] "K" "I" "I" "I" "V" "C" "A" "G" "T" "F" "I" "N" "S" "K" "I" "Y"
[161] "I" "G" "K" "N" "I" "K" "A" "L" "N" "K" "A" "E" "K" "K" "S" "I"
[177] "S" "Y" "S" "F" "K" "K" "I" "N" "L" "F" "I" "S" "K" "L" "K" "T"
[193] "G" "T" "P" "P" "R" "L" "D" "L" "N" "Y" "L" "N" "Y" "K" "K" "L"
[209] "S" "V" "Q" "Y" "S" "D" "Y" "T" "I" "S" "Y" "G" "K" "N" "F" "N"
[225] "F" "N" "N" "N" "V" "K" "C" "F" "I" "T" "N" "T" "D" "N" "K" "I"
[241] "N" "N" "F" "I" "K" "K" "N" "I" "K" "N" "S" "S" "L" "F" "N" "L"
[257] "K" "F" "K" "S" "I" "G" "P" "R" "Y" "C" "P" "S" "I" "E" "D" "K"
[273] "I" "F" "K" "F" "P" "N" "N" "K" "N" "H" "Q" "I" "F" "L" "E" "P"
[289] "E" "S" "Y" "F" "S" "K" "E" "I" "Y" "V" "N" "G" "L" "S" "N" "S"
[305] "L" "S" "Y" "N" "I" "Q" "K" "K" "L" "I" "K" "K" "I" "L" "G" "I"
[321] "K" "K" "S" "Y" "I" "I" "R" "Y" "A" "Y" "N" "I" "Q" "Y" "D" "Y"
[337] "F" "D" "P" "R" "C" "L" "K" "I" "S" "L" "N" "I" "K" "F" "A" "N"
[353] "N" "I" "F" "L" "A" "G" "Q" "I" "N" "G" "T" "T" "G" "Y" "E" "E"
[369] "A" "S" "S" "Q" "G" "F" "V" "A" "G" "I" "N" "S" "A" "R" "K" "I"
[385] "L" "K" "L" "P" "L" "W" "K" "P" "K" "K" "W" "N" "S" "Y" "I" "G"
[401] "V" "L" "L" "Y" "D" "L" "T" "N" "F" "G" "I" "Q" "E" "P" "Y" "R"
[417] "I" "F" "T" "S" "K" "S" "D" "N" "R" "L" "F" "L" "R" "F" "D" "N"
[433] "A" "I" "F" "R" "L" "I" "N" "I" "S" "Y" "Y" "L" "G" "C" "L" "P"
[449] "I" "V" "K" "F" "K" "Y" "Y" "N" "S" "L" "I" "Y" "K" "F" "Y" "K"
[465] "N" "L" "I" "N" "I" "R" "K" "I" "K" "L" "F" "D" "N" "F" "Y" "L"
[481] "F" "K" "L" "I" "I" "I" "M" "S" "K" "Y" "Y" "G" "Y" "I" "K" "K"
[497] "K" "Y" "F" "K"

$`lcl|AP009180.1_prot_BAF35034.1_3`
  [1] "M" "V" "I" "L" "K" "K" "N" "I" "L" "N" "N" "F" "L" "N" "F" "K"
 [17] "I" "I" "D" "L" "N" "L" "I" "I" "L" "L" "L" "F" "I" "H" "L" "I"
 [33] "V" "F" "Y" "L" "L" "K" "N" "N" "N" "L" "M" "I" "L" "L" "S" "I"
 [49] "Y" "L" "N" "N" "F" "I" "K" "N" "S" "I" "N" "L" "N" "S" "R" "N"
 [65] "I" "I" "F" "F" "F" "S" "L" "V" "L" "F" "N" "I" "I" "L" "F" "S"
 [81] "N" "F" "I" "D" "L" "F" "P" "N" "N" "L" "I" "K" "N" "F" "L" "N"
 [97] "L" "K" "Q" "I" "E" "I" "V" "P" "T" "S" "N" "I" "N" "I" "T" "F"
[113] "C" "F" "S" "I" "I" "S" "F" "L" "I" "I" "I" "M" "L" "T" "H" "K"
[129] "K" "I" "G" "F" "K" "K" "Y" "I" "Y" "S" "F" "F" "I" "Y" "P" "I"
[145] "N" "T" "E" "Y" "L" "Y" "L" "F" "N" "F" "I" "I" "E" "S" "I" "S"
[161] "Y" "I" "M" "K" "P" "I" "S" "L" "S" "L" "R" "L" "F" "G" "N" "I"
[177] "F" "S" "S" "E" "I" "I" "F" "N" "I" "I" "N" "N" "M" "N" "V" "F"
[193] "I" "N" "S" "F" "L" "N" "L" "I" "W" "G" "I" "F" "H" "F" "I" "I"
[209] "L" "P" "L" "Q" "S" "F" "I" "F" "I" "T" "L" "V" "I" "I" "Y" "V"
[225] "S" "Q" "T" "L" "N" "H"

$`lcl|AP009180.1_prot_BAF35035.1_4`
 [1] "M" "N" "N" "L" "L" "I" "L" "S" "S" "S" "I" "M" "I" "G" "L" "S"
[17] "S" "I" "G" "T" "G" "I" "G" "F" "G" "I" "L" "G" "G" "K" "L" "L"
[33] "D" "S" "I" "S" "R" "Q" "P" "E" "L" "D" "N" "L" "L" "L" "T" "R"
[49] "T" "F" "L" "M" "T" "G" "L" "L" "D" "A" "I" "P" "M" "I" "S" "V"
[65] "G" "I" "G" "L" "Y" "L" "I" "F" "V" "L" "S" "N" "K"

$`lcl|AP009180.1_prot_BAF35036.1_5`
  [1] "M" "N" "F" "N" "Y" "T" "I" "I" "N" "E" "F" "V" "S" "F" "L" "I"
 [17] "F" "F" "Y" "V" "S" "F" "K" "I" "I" "F" "P" "V" "I" "L" "K" "K"
 [33] "I" "N" "N" "F" "L" "I" "I" "D" "Y" "K" "N" "F" "V" "F" "N" "N"
 [49] "Q" "E" "K" "I" "I" "K" "K" "K" "L" "L" "D" "E" "I" "V" "K" "N"
 [65] "E" "N" "L" "T" "N" "K" "K" "F" "I" "S" "L" "I" "E" "K" "I" "K"
 [81] "K" "S" "I" "L" "L" "E" "K" "Q" "N" "F" "I" "N" "F" "I" "K" "L"
 [97] "E" "K" "I" "N" "V" "L" "K" "I" "F" "K" "K" "K" "I" "L" "N" "N"
[113] "N" "M" "L" "I" "I" "K" "N" "F" "L" "I" "E" "I" "K" "K" "L" "F"
[129] "I" "N" "S" "F" "K" "N" "I" "F" "N" "E" "I" "I" "C" "Y" "N" "N"
[145] "E" "F" "I" "I" "N" "Y" "V"

$`lcl|AP009180.1_prot_BAF35037.1_6`
 [1] "M" "F" "K" "F" "I" "N" "R" "F" "L" "N" "L" "K" "K" "R" "Y" "F"
[17] "Y" "I" "F" "L" "I" "N" "F" "F" "Y" "F" "F" "N" "K" "C" "N" "F"
[33] "I" "K" "K" "K" "K" "I" "Y" "K" "K" "I" "I" "T" "K" "K" "F" "E"
[49] "N" "Y" "L" "L" "K" "L" "I" "I" "Q" "K" "Y" "A" "K"

$`lcl|AP009180.1_prot_BAF35038.1_7`
  [1] "M" "L" "N" "E" "G" "I" "I" "N" "K" "I" "Y" "D" "S" "V" "V" "E"
 [17] "V" "L" "G" "L" "K" "N" "A" "K" "Y" "G" "E" "M" "I" "L" "F" "S"
 [33] "K" "N" "I" "K" "G" "I" "V" "F" "S" "L" "N" "K" "K" "N" "V" "N"
 [49] "I" "I" "I" "L" "N" "N" "Y" "N" "E" "L" "T" "Q" "G" "E" "K" "C"
 [65] "Y" "C" "T" "N" "K" "I" "F" "E" "V" "P" "V" "G" "K" "Q" "L" "I"
 [81] "G" "R" "I" "I" "N" "S" "R" "G" "E" "T" "L" "D" "L" "L" "P" "E"
 [97] "I" "K" "I" "N" "E" "F" "S" "P" "I" "E" "K" "I" "A" "P" "G" "V"
[113] "M" "D" "R" "E" "T" "V" "N" "E" "P" "L" "L" "T" "G" "I" "K" "S"
[129] "I" "D" "S" "M" "I" "P" "I" "G" "K" "G" "Q" "R" "E" "L" "I" "I"
[145] "G" "D" "R" "Q" "T" "G" "K" "T" "T" "I" "C" "I" "D" "T" "I" "I"
[161] "N" "Q" "K" "N" "K" "N" "I" "I" "C" "V" "Y" "V" "C" "I" "G" "Q"
[177] "K" "I" "S" "S" "L" "I" "N" "I" "I" "N" "K" "L" "K" "K" "F" "N"
[193] "C" "L" "E" "Y" "T" "I" "I" "V" "A" "S" "T" "A" "S" "D" "S" "A"
[209] "A" "E" "Q" "Y" "I" "A" "P" "Y" "T" "G" "S" "T" "I" "S" "E" "Y"
[225] "F" "R" "D" "K" "G" "Q" "D" "C" "L" "I" "V" "Y" "D" "D" "L" "T"
[241] "K" "H" "A" "W" "A" "Y" "R" "Q" "I" "S" "L" "L" "L" "R" "R" "P"
[257] "P" "G" "R" "E" "A" "Y" "P" "G" "D" "V" "F" "Y" "L" "H" "S" "R"
[273] "L" "L" "E" "R" "S" "S" "K" "V" "N" "K" "F" "F" "V" "N" "K" "K"
[289] "S" "N" "I" "L" "K" "A" "G" "S" "L" "T" "A" "F" "P" "I" "I" "E"
[305] "T" "L" "E" "G" "D" "V" "T" "S" "F" "I" "P" "T" "N" "V" "I" "S"
[321] "I" "T" "D" "G" "Q" "I" "F" "L" "D" "T" "N" "L" "F" "N" "S" "G"
[337] "I" "R" "P" "S" "I" "N" "V" "G" "L" "S" "V" "S" "R" "V" "G" "G"
[353] "A" "A" "Q" "Y" "K" "I" "I" "K" "K" "L" "S" "G" "D" "I" "R" "I"
[369] "M" "L" "A" "Q" "Y" "R" "E" "L" "E" "A" "F" "S" "K" "F" "S" "S"
[385] "D" "L" "D" "S" "E" "T" "K" "N" "Q" "L" "I" "I" "G" "E" "K" "I"
[401] "T" "I" "L" "M" "K" "Q" "N" "I" "H" "D" "V" "Y" "D" "I" "F" "E"
[417] "L" "I" "L" "I" "L" "L" "I" "I" "K" "H" "D" "F" "F" "R" "L" "I"
[433] "P" "I" "N" "Q" "V" "E" "Y" "F" "E" "N" "K" "I" "I" "N" "Y" "L"
[449] "R" "K" "I" "K" "F" "K" "N" "Q" "I" "E" "I" "D" "N" "K" "N" "L"
[465] "E" "N" "C" "L" "N" "E" "L" "I" "S" "F" "F" "I" "S" "N" "S" "I"
[481] "L"

$`lcl|AP009180.1_prot_BAF35039.1_8`
  [1] "M" "I" "I" "K" "E" "I" "N" "S" "K" "I" "K" "I" "T" "T" "N" "I"
 [17] "N" "K" "L" "T" "N" "T" "L" "S" "M" "I" "S" "L" "S" "K" "M" "N"
 [33] "K" "Y" "I" "N" "L" "I" "N" "N" "L" "D" "Y" "I" "N" "I" "E" "L"
 [49] "K" "K" "I" "L" "E" "Y" "I" "I" "I" "N" "I" "K" "S" "N" "V" "F"
 [65] "C" "L" "I" "I" "I" "T" "S" "N" "K" "G" "L" "C" "G" "N" "L" "N"
 [81] "N" "E" "I" "I" "K" "Y" "S" "L" "N" "Y" "I" "K" "N" "N" "K" "N"
 [97] "L" "D" "L" "I" "L" "I" "G" "K" "K" "G" "I" "D" "F" "F" "N" "K"
[113] "K" "N" "F" "Y" "I" "K" "E" "K" "I" "I" "F" "K" "D" "N" "E" "L"
[129] "K" "N" "L" "V" "F" "N" "N" "K" "I" "L" "N" "D" "L" "K" "K" "Y"
[145] "E" "N" "I" "F" "F" "I" "S" "S" "K" "I" "I" "K" "N" "N" "V" "K"
[161] "I" "I" "K" "T" "D" "L" "Y" "L" "K" "K" "K" "Y" "N" "Y" "L" "I"
[177] "K" "H" "N" "F" "N" "Y" "D" "C" "F" "L" "K" "N" "F" "Y" "N" "Y"
[193] "N" "L" "K" "C" "L" "Y" "L" "N" "N" "L" "F" "C" "E" "L" "K" "S"
[209] "R" "M" "I" "T" "M" "K" "S" "A" "A" "D" "N" "S" "K" "K" "I" "I"
[225] "K" "D" "M" "K" "L" "I" "K" "N" "K" "I" "R" "Q" "F" "K" "V" "T"
[241] "Q" "D" "M" "L" "E" "I" "I" "N" "G" "S" "N" "L"

$`lcl|AP009180.1_prot_BAF35040.1_9`
  [1] "M" "I" "G" "R" "I" "V" "Q" "I" "L" "G" "S" "I" "V" "D" "V" "E"
 [17] "F" "K" "K" "N" "N" "I" "P" "Y" "I" "Y" "N" "A" "L" "F" "I" "K"
 [33] "E" "F" "N" "L" "Y" "L" "E" "V" "Q" "Q" "Q" "I" "G" "N" "N" "I"
 [49] "V" "R" "T" "I" "A" "L" "G" "S" "T" "Y" "G" "L" "K" "R" "Y" "L"
 [65] "L" "V" "I" "D" "T" "K" "K" "P" "I" "L" "T" "P" "V" "G" "N" "C"
 [81] "T" "L" "G" "R" "I" "L" "N" "V" "L" "G" "N" "P" "I" "D" "N" "N"
 [97] "G" "E" "I" "I" "S" "N" "K" "K" "K" "P" "I" "H" "C" "S" "P" "P"
[113] "K" "F" "S" "D" "Q" "V" "F" "S" "N" "N" "I" "L" "E" "T" "G" "I"
[129] "K" "V" "I" "D" "L" "L" "C" "P" "F" "L" "R" "G" "G" "K" "I" "G"
[145] "L" "F" "G" "G" "A" "G" "V" "G" "K" "T" "I" "N" "M" "M" "E" "L"
[161] "I" "R" "N" "I" "A" "I" "E" "H" "K" "G" "C" "S" "V" "F" "I" "G"
[177] "V" "G" "E" "R" "T" "R" "E" "G" "N" "D" "F" "Y" "Y" "E" "M" "K"
[193] "E" "S" "N" "V" "L" "D" "K" "V" "S" "L" "I" "Y" "G" "Q" "M" "N"
[209] "E" "P" "S" "G" "N" "R" "L" "R" "V" "A" "L" "T" "G" "L" "S" "I"
[225] "A" "E" "E" "F" "R" "E" "M" "G" "K" "D" "V" "L" "L" "F" "I" "D"
[241] "N" "I" "Y" "R" "F" "T" "L" "A" "G" "T" "E" "I" "S" "A" "L" "L"
[257] "G" "R" "M" "P" "S" "A" "V" "G" "Y" "Q" "P" "T" "L" "A" "E" "E"
[273] "M" "G" "K" "L" "Q" "E" "R" "I" "S" "S" "T" "K" "N" "G" "S" "I"
[289] "T" "S" "V" "Q" "A" "I" "Y" "V" "P" "A" "D" "D" "L" "T" "D" "P"
[305] "S" "P" "S" "T" "T" "F" "T" "H" "L" "D" "S" "T" "I" "V" "L" "S"
[321] "R" "Q" "I" "A" "E" "L" "G" "I" "Y" "P" "A" "I" "D" "P" "L" "E"
[337] "S" "Y" "S" "K" "Q" "L" "D" "P" "Y" "I" "V" "G" "I" "E" "H" "Y"
[353] "E" "I" "A" "N" "S" "V" "K" "F" "Y" "L" "Q" "K" "Y" "K" "E" "L"
[369] "K" "D" "T" "I" "A" "I" "L" "G" "M" "D" "E" "L" "S" "E" "N" "D"
[385] "Q" "I" "I" "V" "K" "R" "A" "R" "K" "L" "Q" "R" "F" "F" "S" "Q"
[401] "P" "F" "F" "V" "G" "E" "I" "F" "T" "G" "I" "K" "G" "E" "Y" "V"
[417] "N" "I" "K" "D" "T" "I" "Q" "C" "F" "K" "N" "I" "L" "N" "G" "E"
[433] "F" "D" "N" "I" "N" "E" "K" "N" "F" "Y" "M" "I" "G" "K" "I"

$`lcl|AP009180.1_prot_BAF35041.1_10`
 [1] "M" "N" "L" "L" "I" "L" "S" "I" "K" "N" "I" "I" "E" "Y" "K" "N"
[17] "A" "S" "I" "L" "N" "V" "K" "T" "Y" "L" "K" "L" "F" "S" "I" "M"
[33] "N" "N" "H" "I" "N" "N" "I" "C" "D" "V" "N" "Q" "I" "K" "L" "I"
[49] "F" "K" "N" "K" "I" "I" "N" "I" "R" "I" "N" "N" "G" "F" "L" "F"
[65] "Q" "K" "K" "N" "N" "T" "K" "I" "I" "C" "N" "F" "Y" "E" "F" "L"

Output of read.fasta()

A list of vectors of chars. Each element is a sequence object. The first sequence is

proteins[[1]]
  [1] "M" "N" "T" "I" "F" "S" "R" "I" "T" "P" "L" "G" "N" "G" "T" "L"
 [17] "C" "V" "I" "R" "I" "S" "G" "K" "N" "V" "K" "F" "L" "I" "Q" "K"
 [33] "I" "V" "K" "K" "N" "I" "K" "E" "K" "I" "A" "T" "F" "S" "K" "L"
 [49] "F" "L" "D" "K" "E" "C" "V" "D" "Y" "A" "M" "I" "I" "F" "F" "K"
 [65] "K" "P" "N" "T" "F" "T" "G" "E" "D" "I" "I" "E" "F" "H" "I" "H"
 [81] "N" "N" "E" "T" "I" "V" "K" "K" "I" "I" "N" "Y" "L" "L" "L" "N"
 [97] "K" "A" "R" "F" "A" "K" "A" "G" "E" "F" "L" "E" "R" "R" "Y" "L"
[113] "N" "G" "K" "I" "S" "L" "I" "E" "C" "E" "L" "I" "N" "N" "K" "I"
[129] "L" "Y" "D" "N" "E" "N" "M" "F" "Q" "L" "T" "K" "N" "S" "E" "K"
[145] "K" "I" "F" "L" "C" "I" "I" "K" "N" "L" "K" "F" "K" "I" "N" "S"
[161] "L" "I" "I" "C" "I" "E" "I" "A" "N" "F" "N" "F" "S" "F" "F" "F"
[177] "F" "N" "D" "F" "L" "F" "I" "K" "Y" "T" "F" "K" "K" "L" "L" "K"
[193] "L" "L" "K" "I" "L" "I" "D" "K" "I" "T" "V" "I" "N" "Y" "L" "K"
[209] "K" "N" "F" "T" "I" "M" "I" "L" "G" "R" "R" "N" "V" "G" "K" "S"
[225] "T" "L" "F" "N" "K" "I" "C" "A" "Q" "Y" "D" "S" "I" "V" "T" "N"
[241] "I" "P" "G" "T" "T" "K" "N" "I" "I" "S" "K" "K" "I" "K" "I" "L"
[257] "S" "K" "K" "I" "K" "M" "M" "D" "T" "A" "G" "L" "K" "I" "R" "T"
[273] "K" "N" "L" "I" "E" "K" "I" "G" "I" "I" "K" "N" "I" "N" "K" "I"
[289] "Y" "Q" "G" "N" "L" "I" "L" "Y" "M" "I" "D" "K" "F" "N" "I" "K"
[305] "N" "I" "F" "F" "N" "I" "P" "I" "D" "F" "I" "D" "K" "I" "K" "L"
[321] "N" "E" "L" "I" "I" "L" "V" "N" "K" "S" "D" "I" "L" "G" "K" "E"
[337] "E" "G" "V" "F" "K" "I" "K" "N" "I" "L" "I" "I" "L" "I" "S" "S"
[353] "K" "N" "G" "T" "F" "I" "K" "N" "L" "K" "C" "F" "I" "N" "K" "I"
[369] "V" "D" "N" "K" "D" "F" "S" "K" "N" "N" "Y" "S" "D" "V" "K" "I"
[385] "L" "F" "N" "K" "F" "S" "F" "F" "Y" "K" "E" "F" "S" "C" "N" "Y"
[401] "D" "L" "V" "L" "S" "K" "L" "I" "D" "F" "Q" "K" "N" "I" "F" "K"
[417] "L" "T" "G" "N" "F" "T" "N" "K" "K" "I" "I" "N" "S" "C" "F" "R"
[433] "N" "F" "C" "I" "G" "K"

Calculating the GC content

Calculating of GC content

If the DNA has been sequenced then the GC-content can be accurately calculated by simple arithmetic.

GC-content percentage is calculated as \[\frac{G+C}{A+T+G+C}\]

We want to find the GC content of the first gene of E.coli

Proposed solution

  1. Find E.coli genes.
    • how do we find it?
    • which one?
  2. Find the first gene of E.coli
    • What is the “first gene”? near the replication origin?
    • Which strand?
    • How do we find a gene?
  3. Calculating every nucleotide’s numbers
    • let’s assume that the gene is in a vector V
    • use the table() function
    • Then calculate using the GC formula

1. Find E.coli genes.

  • how do we find it?

    We download coding sequencs in FASTA format from NCBI

  • which one?

    The one with accesssion number NC_000913

genes <- read.fasta("NC_000913.ffn", seqtype="DNA", set.attributes = FALSE)

2. Find the first gene of E.coli

  • What is the “first gene”? near the replication origin?

    We take the first element of the list

    genes[[1]]
  • Which strand?

    the strand in the FASTA file

  • How do we find a gene?

    We use genes defined in the FASTA file

3. Calculating every nucleotide’s numbers

  • let’s assume that the gene is in a vector V

    V <- genes[[1]]
  • use the table() function

    count <- table(V)
  • Then calculate using the GC formula

    (count["g"] + count["c"])/(count["g"]+count["c"]+count["a"]+count["t"])

First solution

V <- genes[[1]]
count <- table(V)
(count["g"]+count["c"])/(count["g"]+count["c"]+count["a"]+count["t"])
        g 
0.5151515 

There is a little problem

This was the same solution I used in previous years

But I found that there is a problem

What if the gene is TGTGTGTGTG?

count <- table(c("T","G","T","G","T","G","T","G"))
count

G T 
4 4 
count["C"]
<NA> 
  NA 

Using sum(logic) instead of table()

V <- genes[[1]]
length(V)
[1] 66
sum(V=="c")
[1] 22
sum(V=="g")
[1] 12
sum(V=="a")
[1] 21
sum(V=="t")
[1] 11

 

Notice that A+C+G+T is equal to the sequence length

This solution never gives NA

 

 

 

 

 

What if the gene is TGTGTGTGTG?

Now it works correctly

V <- c("T", "G", "T", "G", "T", "G", "T", "G")
sum(V=="C")
[1] 0

Ther are 0 nucleiotides “C”

Sometimes DNA is lowercase

But we have another problem. Sometimes DNA is lowercase

V <- c("a", "c", "t", "g", "a", "c", "t", "g")
sum(V=="C")
[1] 0

This code gives us a wrong answer. There are 2 nucleotides “C”

Remember than in the computer upper- and lower-case letters are different

Making sure that DNA is uppercase

We use the function toupper(). It takes a string and transforms it into upper case

V <- toupper(c("a","c","t","g","a","c","t","g"))
sum(V=="C")
[1] 2

Working solution

This code implements the idea we developed on the last class

V <- toupper(genes[[1]])
count_C <- sum(V=="C")
count_G <- sum(V=="G")
GC_content <- (count_C +count_G)/length(V)
print(GC_content)
[1] 0.5151515

We will use it on the next class

Summary

Summary

  • read.fasta() gives named lists
  • Use toupper() to get uppercase letters
  • Use sum(V=="X") to count the “X”s in V
    • do not forget to use ==
  • Use length(V) for the sequence length