Blog of Andrés Aravena
CMB2:

Comments on Homework 5

18 March 2020

I received some interesting answers to Homework 5. I want to share them with everybody, so everybody can learn.

Student 1

Student 1 wrote this answer to Question 1:

library(seqinr)
sequences <- read.fasta("NC_000913.fna")

genes_gc_content <- function(genome) {
  positions <- seq(from=1, to=length(genome)-window_size, by= window_size)
  nA<-0
  nG<-0
  nC<-0
  nT<-0
  
  for(i in 1:length(genome)) {
    positions[i]<- genome_gc_content(genome[[i]])
    if(genome[i]=="a") {
      na <- na+1
    }else if(genome[i]=="t") {
      nt <- nt+1
    }else if(genome[i]=="g") {
      ng <- ng+1
    }else if(genome[i]=="c") {
      nc <- nc+1
    }
  }
  answer<- (ng+nc)/(ng+nc+nt+na)
  return(positions)
}

The first two lines are right, but they are not necessary. You can assume that the user of the function will do that work. Remember that there are two scenarios: first when you define the function, and second when someone else uses the function.

The line positions <- seq(from=1, to=length(genome)-window_size, by= window_size) is for Question 2 and is useless here.

Then we have the line positions[i]<- genome_gc_content(genome[[i]]). For one side, if we use genome[[i]] it means that genome is a list. But on the next lines we use genome as a vector. So, what is it?

Now, read again the text of the question. The function must be called window_gc_content, not genes_gc_content as here. Also, it takes three inputs: sequence, position, and size.

There is a detail in the if() commands. Here we test only for lower case letters, but as we saw on classes, sometimes they are upper-case letters.

Student 2

Student 2 wrote something like this:

gc_content<- function(coli_seq){
  A<- sum(coli_seq=="a"|| "A")
  T<- sum(coli_seq=="t"|| "T")
  G<- sum(coli_seq=="g"|| "G")
  C<- sum(coli_seq=="c"|| "C")
  ans<- (G+C)/(G+C+A+T)
  return(ans)
}

The symbol || represents the logical operation “or”.

If x and y are logical values (that is, TRUE or FALSE), then the expression x||y will be TRUE if x is true or if y is true, or both. It will be false only when x and y are false.

This does not work, because we are mixing apples and oranges. The part coli_seq=="a" is a comparison between a vector of letters and a letter; the result is a logic vector. Then the part "A" is a single letter, not a vector, not a logic value.

As a human I can understand that the student wants to say ‘if the sequence is “a” or “A”’, but the computer is dumb and does not understand. We need to say ‘if the sequence is “a” or the sequence is “A”’.

Use help(``||``) to se the manual Moreover, the operand || may not be the correct one, since it returns a single value. Better use |, as indicated in the help page.

Student 3

Now we have this:

window_gc_content <- function (genome, position, size){
  genome <- seq(from=250000, length=100)
  position <- 250000
  size <- 100
  nA <- 0
  nC <- 0
  nG <- 0
  nT <- 0
  for(i in 1:length(seq(from=250000, length=100))) {
    if(genome[i]=="A" | genome[i]=="a"){
      nA <- nA+1
    } else  if(genome[i]=="C" | genome[i]=="c"){
      nC <- nC+1
    } else  if(genome[i]=="G" | genome[i]=="g"){
      nG <- nG+1
    } else  if(genome[i]=="T" | genome[i]=="t"){
      nT <- nT+1
    }
  }
  answer <- (nG+nC)/(nA+nT+nG+nC)
 
  return(answer)
}

This time the name and inputs are correct. But then immediately —in lines 2, 3, and 4— we are destroying the input. That is a “no–no”. Do not do that, never. Input is sacred. Treat it with utmost respect.

We need to answer the following questions:

  • What is genome? a list? a vector? a number?
  • What is position? a list? a vector? a number?
  • What is size? a list? a vector? a number?
  • Are we using these inputs in the code?

The result should change when any of the inputs changes. Otherwise they are not inputs.

Student 4

Here I don’t even know who answered, because the answer file is like this:

#'title: "Homework 5"
#'subtitle: "Computing in Molecular Biology and Genetics II"
#'author: "Put your name here"
#'number: STUDENT_NUMBER
#'date: "March 11, 2020"

Remember that this is the last year that we teach this content. Next year we will do something completely different. Maybe Python, maybe other language.

This person has no name, no student number, and lives in the past. If this was the exam, this person gets 0. Not a good idea if you want to pass the course. There is only one way to pass this course: get good grades.

Be careful with these details. Be professional. If you are not careful in the computer, how can we trust you with expensive reactives in the lab? How can we trust you with the analysis of critical data?

Student 4 did something good: recycling the function gc_content() that we created in the class. This function takes only one input: a vector of letters, representing a DNA sequence.

Then we have this proposed solution:

window_gc_content <- function(sequence, position, size) {
  sequen <- rep("a", 1000)
  sequen <- genome[[1]][seq(from=position, length=size)]
  for(i in 1: length(sequen)) {
    sequen [i] <- gc_content(sequen[i])
    answer <- gc_content(sequne)
  }
  return(sequne)
}

In the second line we assign a value to sequen, and then in the next line we change it immediately. So we do not need the second line. The variable sequen is a vector of character. It takes part of genome[[1]], using seq(from=position, length=size) as index. This is a very good idea.

On the other side, there is no genome among the inputs. Where did genome come from?

Line 5 is weird. We assign a number to sequen[i], which was before a vector of characters. So we changed letters to numbers. That is like putting apples in the egg box.

It is a good idea to use the gc_content() function. It takes a vector of character, but in line 5 we give it sequen[i], which is a single character, not a vector.

What about answer? We create this variable but we never use it.

Student 5

At the last minute I received this:

window_gc_content <- function(sequence, position, size)
  for(i in 1:length(genome)) {
    if(genome[i]=="A"| genome[i]=="a") {
      nA <- nA+1
    } else if (genome[i]=="C" | genome[i]=="c"){
      nC <- nC+1
    } else if (genome[i]=="G" | genome[i]=="g") {
      nG <- nG+1
    } else if (genome[i]=="T" | genome[i]== "t") {
      nT <- nT+1
    }
  }
  answer <- (nG+nC)/(nA+nT+nG+nC)

  answer <- rep("a", 1000)
  answer <- seq(from=sequences[[1]][250000], length=1000)
  return(answer)
}

The first line is nice, but it does not have the { symbol. The second one uses genome, which is not an input nor something we created. It appeared from nowhere… it is magic. In other words, it is wrong, since magic does not exist.

Again, the three inputs are never used in the code. They should be used.

Finally, answer is assigned three times. The first time it gets a number, which is what it should be. The second time we destroy the good answer and we replace it with a vector 1000 letters "a". Finally we destroy it again and we assign a vector of numbers, that do not depend on the GC content.

Moreover, the question says that sequences is a vector of character, not a list. So we cannot use sequences[[1]], which is only valid for lists.

Finally, the function seq() can only take numeric inputs. In other words, when we say seq(from=x, length=y), then x and y must be numbers. But sequences[[1]][250000] is not a number.

By the way, the numbers 1000 and 250000 should not be in your answer. They are examples. Your function should not depend on them. Always think on the general case, not the particular one.

Conclusion

“I am a great believer in luck. The harder I work, the more of it I seem to have.”
Some famous guy