Blog of Andrés Aravena
CMB2:

Comments on Midterm Exam, part 1

04 August 2020

Question 1 of Midterm exam dealt with codon usage. The first three parts are identical to the final exam, so we will not repeat the discussion here.

1.4 Absolute to relative frequencies

Write a function called count_to_frequency() that takes a single vector with integer numbers (such as the number of times each codon appears), and returns a new vector of the same size with the relative frequencies. In other words, each value is divided by the total.

The input is the vector number_of_codons. The total is sum(number_of_codons). Therefore the answer is just this:

count_to_frequency <-  function(number_of_codons) {
  number_of_codons/sum(number_of_codons)
}

1.5 Apply count_to_frequency()

Calculate the relative frequencies of codon usage for each gene, and in total. Create a vector called cell_codon_frequency using the function count_to_frequency() on total_codon_count. Then create a list called gene_codon_frequency. Each element of the list contains the result of using count_to_frequency() applied to each element of genes_codon_count.

The first part is trivial. We just “Create a vector called cell_codon_frequency using the function count_to_frequency() on total_codon_count.” Translate English to R.

cell_codon_frequency <- count_to_frequency(total_codon_count)

The second part is easy, and follows a pattern we have seen before. We create an empty list using the function list(). Notice that, unlike vectors, it is not easy to make a list of a predetermined size. But it is not important, since the list grows automatically.

gene_codon_frequency <- list()
for(i in 1:length(genes_codon_count)) {
    gene_codon_frequency[[i]] <- count_to_frequency(genes_codon_count[[i]])
}

You can also use the advanced function lapply(). This allows us to solve this kind of questions in one single line.

gene_codon_frequency <- lapply(genes_codon_count, count_to_frequency)

These two codes are equivalent, but the second is faster and shorter.

1.6 Absolute distance

Write a function called abs_distance(a, b) that takes two vectors and returns the sum of the absolute values of each a[i] minus b[i]. The result is a single non-negative number.

Several possible solutions. The first one uses an auxiliary variable to accumulate the sum

abs_distance <- function(a, b) {
  add <- 0
  for(i in 1:length(a)) 
    add <- add + abs(a[i]-b[i])
  return(add)
}

For some people it is easier to think about building a vector with the absolute difference, and then add all elements

abs_distance <- function(a, b) {
  ans <- rep(NA, length(a))
  for(i in 1:length(a)) {
    ans[i] <- abs(a[i]-b[i])
  }
  return(sum(ans))
}

If you remember that vectors can be combined with arithmetic operations, you can rewrite the last solution without using for(), as this:

abs_distance <- function(a, b) {
  ans <- abs(a-b)
  return(sum(ans))
}

In this version the ans vector is built in one step. Faster and shorter. If you want it even more short, skip ans and go directly:

abs_distance <- function(a, b) {
  sum(abs(a-b))
}

An example of a wrong answer:

## WRONG CODE
abs_distance <- function(a, b) {
  # what is `length(a, b)`?
  for(i in 1:length(a, b)) {
    add <- abs(a[i]-b[i])
    # the variable `add` gets only one value
    # it is updated on every loop
    # at the end it gets only the last `abs(a[i]-b[i])`
    # the rest is forgotten
  }
  return(sum(add))
  # this sum is only adding one number
  # the rest is forgotten
}

It is interesting to think about lenght(a, b). Probably the student was thinking that we should consider the length of both vectors. The question is “how to combine them?”. If we do length(c(a, b)) we get a number that is too big: length(a) + length(b). This is bigger than both vectors.

It is much safer to consider max(length(a), length(b)), but we get into trouble again if one of the vectors is larger. In fact the distance only makes sense if both vectors have the same length. This is always the case in this exam.

1.7 Calculate all distances

Calculate the distances between every vector in gene_codon_frequency and the vector cell_codon_frequency, and store them in a vector called distance. The result contains one entry for each gene.

This is again a typical pattern, in which the same function is applied to each element of a list, and we assign the result to a vector.

distance <- rep(NA, length(gene_codon_frequency))
for(i in 1:length(gene_codon_frequency))
    distance[i] <- abs_distance(gene_codon_frequency[[i]],
                                cell_codon_frequency)

As we discussed earlier, this can be done in one line with the function sapply(). You should really learn it.

distance <- sapply(gene_codon_frequency, abs_distance, cell_codon_frequency)

1.8 Find the most different gene.

Write the code to find the name of the gene which has the greatest value on distance.

The greatest value on distance is found using

max(distance)

but we do not want that. We want the position of the greatest value

which.max(distance)

and then we use that position as an index for the vector names(genes)

names(genes)[which.max(distance)]

1.9 (Bonus) Find the 6 genes most different from cell_codon_frequency.

To get the most different genes, you need to sort the distance vector

sort(distance, decreasing = TRUE)

but this will give you all the genes, and we want only the first 6. We can use an index

sort(distance, decreasing = TRUE) [1:6]

or we can use the function head(), which —who would guess— gives us the first 6 elements

head(sort(distance, decreasing = TRUE))

The problem is that we get the values, not the names. One solution is to assign names to the distance vector, and then look at the names of the top genes.

names(distance) <- names(genes)
names(head(sort(distance, decreasing = TRUE)))

Another possibility is to replace sort() by order(), which is a more general way to solve these questions:

names(genes)[head(order(distance, decreasing = TRUE))]