Comments on Question 2.4 – Final Exam – Blog of Andrés Aravena

To explain question 2.4 I will show an example. First, please take a look at the description from the original question. Here we have numbered the steps, so we can talk about them.

For each amino-acid AA:
1. Find the codons that encode AA it and store them in a vector called codons_for_aa.
2. Find the maximum of total_codon_count among the codons in codons_for_aa. Look only at these codons, ignore the rest. Store this maximum in max_count_for_aa.
3. For each codon in codons_for_aa:
  1. Find the value of total_codon_count for that particular codon.
  2. Divide it by max_count_for_aa.
  3. Calculate the logarithm of the result of the last division, and store it in codon_adaptness using the codon as the index.
  4. Repeat for each codon in codons_for_aa.
4. Repeat for each amino acid.

If you translate this description from English to R, then you will have the answer.

From this description you can see that there are two for() loops. They are nested, one inside the other. Remember that in this case you must use different names of variables on each loop. For example i and j —or any other word. You cannot use i and i, because the computer will be confused. It is a good idea to choose names that remind you of the things you are looking at (amino-acid, codon).

The key of this question is to use genetic_code as a tool, not as an problem. Some people tries to use unlist() —a command that we never used in classes— to transform the list into a vector. But in that case you get all the codons, not only the ones for a specific amino acid. Too complicated, and wrong.

The vector names(genetic_code) has all the amino-acid’s names. Each element of the list genetic_code is a vector, containing only the codons of the corresponding amino-acid. If aa is the name of an amino-acid, then genetic_code[[aa]] will be a vector containing the codons that encode aa.

The first loop must take each amino-acid name, one by one. It will go through "Ala", then "Arg", then "Asn", up to "Val".

The vector total_codon_count can be indexed using the codon names. Thus, if you use codons_for_aa as an index for total_codon_count, then you will be working only among the codons in codons_for_aa, as requested.

Let’s follow the instructions step by step.

Step 1 says “Find the codons that encode one amino-acid”. For example, let’s say that the main for() loop has taken the amino-acid name "Val". Using this name as an index for genetic_code, we can find that the codons for this amino-acid are "gta", "gtc", "gtg", "gtt". We store these values in the vector codons_for_aa. Obviously we do not do this by hand. Instead you tell the computer to do it for us.

Step 2 says “Find the maximum of total_codon_count among the codons in codons_for_aa.” Let’s say that there are 560 valines in total (in all the genes). We use codons_for_aa as index for total_codon_count and get the count of each codon:

gta            gtc           gtg          gtt
410            110            10           30

This small vector is what we call “only the codons of valine”. We ignore for now the rest of the codons, since in this part we are working only with valine.

The maximum value of this small vector is 410. We store that value in max_count_for_aa. We finished step 2.

In step 3 we have a second for() loop, going through the codons in codons_for_aa one by one. Let’s say that the codon is "gtc" (we work on the other codons later). In step 3.1 we use this codon as an index to look inside total_codon_count and we find the value 110. That is the count of "gtc".

In step 3.2 we divide that value by max_count_for_aa. We get 110/410 or in other words 0.2682927. In step 3.3 we calculate log(110/410) and we get -1.315677. We store this calculated value into the corresponding element of vector codon_adaptness. We have to be careful to use the correct index.

At the end, as usual, it is all about using the correct index in the correct order.

We do this for each codon of each amino-acid in genetic_code.

(Notice that this is not the relative frequency. We are calculating something else here, so the formula is something else.)