Blog of Andrés Aravena
CMB2:

Comments on Final Exam, part 2

21 July 2020

The second exam question is a nice one. It is not too hard, not too easy, and is very practical. As you can read in the references given in the exam, this is what real molecular biologists use to identify horizontal transference. World-class molecular biology is 50% computing.

2.1 Load all genes

Download the FASTA file containing all the genes of your bacteria and read them into the list genes.

This was asked also in the midterm, and it is very easy. Everybody can have a different filename. The filename is not important.

library(seqinr)
genes <- read.fasta("NC_000913.ffn")

There are two typical errors in this question. Some people forgot to say library(seqinr) so the code does not work. On the other hand, other people included install.packages("seqinr"). That is wrong. Installing packages is never part of the answer, since you must install in only once, and the code is executed several times.

Use just these two lines.

2.2 Count codons in each gene

Create a list called genes_codon_count, containing one vector for each gene in the bacterial genome. Each vector is the result of using codon_count() on each gene.

This is also from the midterm exam. Very easy. With a for() loop we apply the function codon_count() to each element of the list genes, and we store the result in the list genes_codon_count.

genes_codon_count <- list()
for(i in 1:length(genes)) {
    genes_codon_count[[i]] <- codon_count(genes[[i]])
}

It is important to use double brackets to work with list elements. Also, keep in mind that the input of the codon_count() function can only be a vector, not a list.

There is a difference between vectors and lists when we create them. In most cases we can know the vector size before we know its content, so we can use rep(NA, size). But we cannot do that for lists. It is not easy to prepare a list with a fixed size. So we create genes_codon_count using genes_codon_count <- list() and do not worry about its size. We then just add elements, using the double brackets, like in genes_codon_count[[i]] <- codon_count(genes[[i]]).

Some clever students used the advanced function lapply(). This allows us to solve this kind of questions in one single line.

genes_codon_count <- lapply(genes, codon_count)

It is worth learning about lapply() and similar functions. This is even faster if you work in a server, since lapply() can be done in multiple processors at the same time. That allows us to run complex code very fast.

2.3 Count codons in all genes

Create the vector total_codon_count containing the total number of times that each codon appears on all genes.

Many people got confused in this question. That is a good way to measure if the student understands or not. If you only study, you may not understand. Studying is not enough. You need to learn.

The direct way to solve this question is to just add all the values, one by one. That was explained in the blog. Apparently people get confused when one variable gets updated. You should not. Variables are variable, that is, they change. Otherwise they will be called constants. Look at the blog and at this code.

total_codon_count <- rep(0, 64)
names(total_codon_count) <- valid_codons
total_codon_count <- total_codon_count + genes_codon_count[[1]]
total_codon_count <- total_codon_count + genes_codon_count[[2]]
total_codon_count <- total_codon_count + genes_codon_count[[3]]
total_codon_count <- total_codon_count + genes_codon_count[[4]]
total_codon_count <- total_codon_count + genes_codon_count[[5]]
total_codon_count <- total_codon_count + genes_codon_count[[6]]
total_codon_count <- total_codon_count + genes_codon_count[[7]]
total_codon_count <- total_codon_count + genes_codon_count[[8]]
total_codon_count <- total_codon_count + genes_codon_count[[9]]
total_codon_count <- total_codon_count + genes_codon_count[[10]]

total_codon_count <- total_codon_count +
                    genes_codon_count[[length(genes_codon_count)]]

Obviously this is not practical. It is too complicated, it is easy to make mistakes, and you really do not know how many genes are in the organism you will study. Notice that your code should work for all organisms, not only for the one you got assigned.

So, if you understand this question, you realize that you must use a for(){} loop.

total_codon_count <- rep(0, 64)
names(total_codon_count) <- valid_codons
for(i in 1:length(genes_codon_count)) {
  total_codon_count <- total_codon_count + genes_codon_count[[i]]
}

That is the solution. It is very easy, since you can —and should— add vectors to vectors. If you do not feel comfortable adding vectors, you can do like this:

total_codon_count <- rep(0, 64)
names(total_codon_count) <- valid_codons
for(i in 1:length(genes_codon_count)) {
    for(j in valid_codons) {
      total_codon_count[j] <- total_codon_count[j] + genes_codon_count[[i]][j]
    }
}

But really, get over it. Add vectors to vectors.

2.4 Compute the relative adaptiveness of each codon

We need to calculate the relative adaptiveness of each codon, using the values in the total_codon_count vector.

You can calculate the adaptiveness of every codon, and store only these codons it in the vector codon_adaptness, by doing the following steps:

  • For each amino-acid AA:
    • Find the codons that encode AA it and store them in a vector called codons_for_aa.
    • Find the maximum of total_codon_count among the codons in codons_for_aa. Look only at these codons, ignore the rest. Store this maximum in max_count_for_aa.
    • For each codon in codons_for_aa:
      • Find the value of total_codon_count for that particular codon.
      • Divide it by max_count_for_aa.
      • Calculate the logarithm of the result of the last division, and store it in codon_adaptness using the codon as the index.
      • Repeat for each codon in codons_for_aa.
    • Repeat for each amino acid.

Besides this description, this code was provided:

genetic_code <- list(Ala = c("gca", "gcc", "gcg", "gct"),
     Arg = c("aga", "agg", "cga", "cgc", "cgg", "cgt"),
     Asn = c("aac", "aat"), Asp = c("gac", "gat"),
     Cys = c("tgc", "tgt"), Gln = c("caa", "cag"),
     Glu = c("gaa", "gag"), Gly = c("gga", "ggc", "ggg", "ggt"),
     His = c("cac", "cat"), Ile = c("ata", "atc", "att"),
     Leu = c("cta", "ctc", "ctg", "ctt", "tta", "ttg"),
     Lys = c("aaa", "aag"), Phe = c("ttc", "ttt"),
     Pro = c("cca", "ccc", "ccg", "cct"),
     Ser = c("agc", "agt", "tca", "tcc", "tcg", "tct"),
     Thr = c("aca", "acc", "acg", "act"),
     Tyr = c("tac", "tat"), Val = c("gta", "gtc", "gtg", "gtt"))

codon_adaptness <- rep(0, 64)
names(codon_adaptness) <- valid_codons

To get the solution, we translate the instructions from English to R language.

# For each amino-acid _AA_:
for(aa in 1:length(genetic_code)) {
    # Find the codons that encode _AA_ it and store them in a vector called `codons_for_aa`.
    codons_for_aa <- genetic_code[[aa]]
    # Find the maximum of `total_codon_count` _among the codons in_ `codons_for_aa`.
    # Look only at these codons, ignore the rest.
    # Store this maximum in `max_count_for_aa`.
    max_count_for_aa  <- max(total_codon_count[codons_for_aa])
    # For each codon in `codons_for_aa`:
    for(j in 1:length(codons_for_aa)) {
        codon <- codons_for_aa[j]
        # Find the value of `total_codon_count` for that particular codon.
        # Divide it by `max_count_for_aa`.
        # Calculate the logarithm of the result of the last division,
        # and store it in `codon_adaptness` using the codon as the index.
      codon_adaptness[codon] <- log(total_codon_count[codon]/max_count_for_aa)
    # Repeat for each codon in `codons_for_aa`.
    }
# Repeat for each amino acid.
}

Most of the problems that students show in their answers are related to two things: confusing lists and vectors, and remembering the different ways to use indices.

The goal of this course is to learn to use a computer, the same way you learn to use a bike or you learn to walk and talk. If you really learn, you never forget. It is not about memory, it is about understanding, which you show when you use your tools in a new way.

For example, to go through all amino-acids, we can use a for() loop with aa in 1:length(genetic_code) or we can use aa in names(genetic_code). In the first case the variable aa takes numeric values like 1, 2, …, 20. In the second case the variable aa takes text values like "Ala", "Arg", "Asn", …, "Val". Both can be used as indices for the list genetic_code:

This is a second valid solution, using text vectors instead of numeric vectors

for(aa in names(genetic_code)) {
    codons_for_aa <- genetic_code[[aa]]
    max_count_for_aa  <- max(total_codon_count[codons_for_aa])
    for(codon in codons_for_aa) {
      codon_adaptness[codon] <- log(total_codon_count[codon]/max_count_for_aa)
    }
}

You can even read it as if it was written in English language. This answer can be shortened if we remember that we can use a vector as an index. This allows us to get rid of the internal for() loop:

for(aa in names(genetic_code)) {
    codons_for_aa <- genetic_code[[aa]]
    max_count_for_aa  <- max(total_codon_count[codons_for_aa])
    codon_adaptness[codons_for_aa] <- log(total_codon_count[codons_for_aa]/max_count_for_aa)
}

Finally, the code can be even shorter if the main for() loop goes directly on the list genetic_code. Remember that a for() loop can go over each element of a vector or a list.

for(codons_for_aa in genetic_code) {
    counts_for_aa <- total_codon_count[codons_for_aa]
    codon_adaptness[codons_for_aa] <- log(counts_for_aa/max(counts_for_aa))
}

That is three lines of code.

The main problem in this question was misunderstanding what was requested. Several people combined all codons in one big vector, using something like unlist(genetic_code). This does not work, because we need to calculate the most used codon of each amino-acid. If we mix all the codons then you cannot tell which codons correspond to each amino-acid.

It seems like some people believe that the question is “tricky”, as if the professor is intentionally making things harder, and you have to undo it. That is a wrong idea. In reality the professor gives you all the parts that you need to solve the problem easily. The hardest parts are already solved for you. You only need to do the easy part.

Remember that CAI is a real tool used for scientific research. You can read about it in the paper mentioned in the question, or you can search the internet for it. Moreover, there is a cai() function in R. The math in the exam question is a little more easy than the real CAI, just to make a nicer exam.

2.5 Evaluate CAI for one gene

Write a function, called gene_cai(), that takes a vector number_of_codons and returns the weighted average of codon_adaptness for each codon.

In other words, you must multiply each component of number_of_codons with the corresponding component of codon_adaptness, sum all the results, and then divide by the sum of all number_of_codons.

gene_cai <- function(number_of_codons, codon_adaptness) {
  # we need a place to store the sum of all results
  ans <- 0
  # for each component of `number_of_codons`
  for(i in 1:length(number_of_codons)) {
    # multiply each component of `number_of_codons` with the corresponding
    # component of `codon_adaptness` and sum all the results
    ans <- ans + codon_adaptness[i]*number_of_codons[i]
  }
  # then divide by the sum of all `number_of_codons`
  return(ans/sum(number_of_codons))
}

Some clever students remembered that you can multiply directly two vectors, and sum them, so the for() loop can be avoided.

gene_cai <- function(number_of_codons, codon_adaptness) {
  ans <- sum(number_of_codons * codon_adaptness)
  return(ans/sum(number_of_codons))
}

and then we can realize that ans is only used once, so we can get rid of it.

gene_cai <- function(number_of_codons, codon_adaptness) {
  return(sum(codon_adaptness * number_of_codons)/sum(number_of_codons))
}

That is two lines of code.

Some curious answers took the long way to calculate the sum of number_of_codons. For example some students wrote something like

acc <- 0
for (i in length(number_of_codons)) {
    acc_skew <- number_of_codons[i] + acc_skew
}

This is a correct answer, but unnecessarily complicated. Just say sum(number_of_codons). Another unnecessarily complicated code is

m <- rep(0, 64)
add <- 0
for(i in 1:64) {
    m[i] <- number_of_codons[i]*codon_adaptness[i]
    add <- m[i] + add
    ans <- add/acc
}

There is no need for the m vector. And the line ans <- add/acc does not need to be inside the for() loop, since we only care about it after we added all the m[i]. A short code, with the same idea, would be

add <- 0
for(i in 1:64) {
    add <- number_of_codons[i]*codon_adaptness[i] + add
}
ans <- add/acc

This is very similar to the first solution shown above.

2.6 Calculate CAI for all genes.

Calculate the Codon Adaptation Index for every gene and store these values in a vector called CAI.

This is similar to question 2.3. The only difference is that the result is a vector, not a list.

{r echo=FALSE} # prepare a vector to receive the answer CAI <- rep(NA, length(genes_codon_count)) for(i in 1:length(genes_codon_count)) { CAI[i] <- gene_cai(genes_codon_count[[i]], codon_adaptness) }

Like in question 2.3, this can be written in a single line using the function sapply, as in the following code:

CAI <- sapply(genes_codon_count, codon_adaptness)

The function sapply() works exactly like lapply(), except that it returns a vector instead of a list. It is worth learning about sapply(), lapply() and similar functions.

2.7 Find the gene with the smallest CAI

Write the code to find the index of the smallest value in the vector CAI. Assign that index in the variable index_of_smallest_CAI.

Here the only important thing is to use which.min() and not min(). We are not loonking for the smallest value. Instead we are looking for the index that points to the location of the smallest value.

index_of_smallest_CAI <- which.min(CAI)