Blog of Andrés Aravena
CMB2:

Comments on Midterm Exam, part 2

04 August 2020

This question deals with Genome Analysis. We want to “draw the genome” using Turtle Graphics, and then find the replication origin using the GC-skew.

2.1 Load the bacterial genome

Load the FASTA file of the bacterial genome (not the genes) and assign the first element to a vector called genome.

This is very easy and has been discussed several times already. Just remember that you want a vector, and read.fasta() gives you a list.

2.2 Draw the genome using Turtle Graphics

To make it easy, we will draw only the first 1000 nucleotides.

For each position in the genome between 1 and 1000, if the nucleotide is "a", the turtle moves one step to the left. If the nucleotide is "c", the turtle moves one step down. If the nucleotide is "g", the turtle moves one step up. Finally, if the nucleotide is "t", the turtle moves one step to the right.

I think we did this on a class. It is very straightforward:

library(TurtleGraphics)
turtle_init(mode="clip")
turtle_hide()
turtle_do({
  for(i in 1:1000) {
    if(genome[i]=="a") {
      turtle_setangle(-90)
    } else if(genome[i]=="c") {
      turtle_setangle(180)
    } else if(genome[i]=="g") {
      turtle_setangle(0)
    } else if(genome[i]=="t") {
      turtle_setangle(90)
    }
    turtle_forward(1)
  }
})

Sometimes the genome sequence is written using upper case letters. Thus, a better version would be:

library(TurtleGraphics)
turtle_init(mode="clip")
turtle_hide()
turtle_do({
  for(i in 1:1000) {
    if(genome[i]=="a" | genome[i]=="A") {
      turtle_setangle(-90)
    } else if(genome[i]=="c" | genome[i]=="C") {
      turtle_setangle(180)
    } else if(genome[i]=="g" | genome[i]=="G") {
      turtle_setangle(0)
    } else if(genome[i]=="t" | genome[i]=="T") {
      turtle_setangle(90)
    }
    turtle_forward(1)
  }
})

The code genome[i]=="a" | genome[i]=="A" means “the nucleotide in position i of genome has the value "a", or the nucleotide in position i of genome has the value "A"”.

2.3 Non-overlapping windows of size 1000

Create a vector called genome_position with a sequence of positions starting at 1, increasing by 1000 up to the last position in the genome minus 1000.

(We stop at least 1000 nucleotide before the genome end to make sure that each window is inside the genome)

This is just translating English into R.

window_size <- 1000
genome_position <- seq(from=1, to=length(genome)-window_size, by=window_size)

If you do not want to use window_size, you can say

genome_position <- seq(from=1, to=length(genome)-1000, by=1000)

2.4 GC-skew

Calculate the GC-skew of each window starting in the positions given by genome_position, and store them in a vector called skew.

You can use the following function, that we used in a homework.

window_gc_skew <- function(genome, position, window_size) {
  nC <- nG <- 0
  for(i in seq(from=position, length=window_size)) {
    if(genome[i]=="c" | genome[i]=="C") {
      nC <- nC + 1
    } else if(genome[i]=="g" | genome[i]=="G") {
      nG <- nG + 1
    }
  }
  if(nG+nC==0) { return(0) } else { return((nG-nC)/(nG+nC)) }
}

As in many other cases, we apply the same function to each element of a list and assign the result to a vector.

skew <- rep(NA, length(genome_position))
for(j in 1:length(genome_position)) {
  skew[j] <- window_gc_skew(genome, genome_position[j], window_size)
}

This is a pattern that we use often. Understand and use it.

2.5 Accumulated GC-skew

Calculate the accumulated GC-skew and store it in a vector called acc_skew

This is taken directly from the example of balance and incoming money.

acc_skew <- rep(NA, length(skew))
acc_skew[1] <- skew[1]
for(i in 2:lenght(skew)) {
    acc_skew[i] <- acc_skew[i-1] + skew[i]
}

“The acc_skew of today is the acc_skew of yesterday plus the skew of today”.

For this particular case there is a special function to calculate cumulative sums:

acc_skew <- cumsum(skew)

2.6 Extremal acc_skew values

Find the genome position of the window which has the maximum value of acc_skew. Do the same for the minimum value.

As in all question about maximum, it is important to understand if we look for the value or the index of the largest element. In this case we need the index so we use which.max().

But this is an index in a vector, not a position in the genome. Therefore, we must use this index to point inside the correct vector. Since the question asks for the genome position, we use the vector genome_position.

genome_position[which.max(acc_skew)]
genome_position[which.min(acc_skew)]