Question 1 of Midterm exam dealt with *codon usage*. The first
three parts are identical to the final
exam, so we will not repeat the discussion here.

# 1.4 Absolute to relative frequencies

Write a

functioncalled`count_to_frequency()`

that takes a single vector with integer numbers (such as the number of times each codon appears), and returns a new vector of the same size with the relative frequencies. In other words, each value is divided by the total.

The input is the vector `number_of_codons`

. The total is
`sum(number_of_codons)`

. Therefore the answer is just
this:

```
<- function(number_of_codons) {
count_to_frequency /sum(number_of_codons)
number_of_codons }
```

# 1.5 Apply `count_to_frequency()`

Calculate the relative frequencies of codon usage for each gene, and in total. Create a vector called

`cell_codon_frequency`

using the function`count_to_frequency()`

on`total_codon_count`

. Then create a list called`gene_codon_frequency`

. Each element of the list contains the result of using`count_to_frequency()`

applied to each element of`genes_codon_count`

.

The first part is trivial. We just “Create a vector called
`cell_codon_frequency`

using the function
`count_to_frequency()`

on `total_codon_count`

.”
Translate English to R.

`<- count_to_frequency(total_codon_count) cell_codon_frequency `

The second part is easy, and follows a pattern we have seen before.
We create an empty list using the function `list()`

. Notice
that, unlike vectors, it is not easy to make a list of a predetermined
size. But it is not important, since the list grows automatically.

```
<- list()
gene_codon_frequency for(i in 1:length(genes_codon_count)) {
<- count_to_frequency(genes_codon_count[[i]])
gene_codon_frequency[[i]] }
```

You can also use the advanced function `lapply()`

. This
allows us to solve this kind of questions in one single line.

`<- lapply(genes_codon_count, count_to_frequency) gene_codon_frequency `

These two codes are equivalent, but the second is faster and shorter.

# 1.6 Absolute distance

Write a

functioncalled`abs_distance(a, b)`

that takes two vectors and returns the sum of the absolute values of each`a[i]`

minus`b[i]`

. The result is a single non-negative number.

Several possible solutions. The first one uses an auxiliary variable to accumulate the sum

```
<- function(a, b) {
abs_distance <- 0
add for(i in 1:length(a))
<- add + abs(a[i]-b[i])
add return(add)
}
```

For some people it is easier to think about building a vector with the absolute difference, and then add all elements

```
<- function(a, b) {
abs_distance <- rep(NA, length(a))
ans for(i in 1:length(a)) {
<- abs(a[i]-b[i])
ans[i]
}return(sum(ans))
}
```

If you remember that vectors can be combined with arithmetic
operations, you can rewrite the last solution without using
`for()`

, as this:

```
<- function(a, b) {
abs_distance <- abs(a-b)
ans return(sum(ans))
}
```

In this version the `ans`

vector is built in one step.
Faster and shorter. If you want it even more short, skip
`ans`

and go directly:

```
<- function(a, b) {
abs_distance sum(abs(a-b))
}
```

An example of a wrong answer:

```
## WRONG CODE
<- function(a, b) {
abs_distance # what is `length(a, b)`?
for(i in 1:length(a, b)) {
<- abs(a[i]-b[i])
add # the variable `add` gets only one value
# it is updated on every loop
# at the end it gets only the last `abs(a[i]-b[i])`
# the rest is forgotten
}return(sum(add))
# this sum is only adding one number
# the rest is forgotten
}
```

It is interesting to think about `lenght(a, b)`

. Probably
the student was thinking that we should consider the length of both
vectors. The question is “how to combine them?”. If we do
`length(c(a, b))`

we get a number that is too big:
`length(a) + length(b)`

. This is bigger than both
vectors.

It is much safer to consider `max(length(a), length(b))`

,
but we get into trouble again if one of the vectors is larger. In fact
the distance only makes sense if both vectors have the same length. This
is always the case in this exam.

## 1.7 Calculate all distances

Calculate the distances between every vector in

`gene_codon_frequency`

and the vector`cell_codon_frequency`

, and store them in a vector called`distance`

. The result contains one entry for each gene.

This is again a typical pattern, in which the same function is applied to each element of a list, and we assign the result to a vector.

```
<- rep(NA, length(gene_codon_frequency))
distance for(i in 1:length(gene_codon_frequency))
<- abs_distance(gene_codon_frequency[[i]],
distance[i] cell_codon_frequency)
```

As we discussed earlier, this can be done in one line with the
function `sapply()`

. You should really learn it.

`<- sapply(gene_codon_frequency, abs_distance, cell_codon_frequency) distance `

## 1.8 Find the most different gene.

Write the code to find the name of the gene which has the greatest value on

`distance`

.

The greatest value on `distance`

is found using

`max(distance)`

but we do not want that. We want the *position* of the
greatest value

`which.max(distance)`

and then we use that position as an index for the vector
`names(genes)`

`names(genes)[which.max(distance)]`

## 1.9 (*Bonus*) Find the 6 genes most different from
`cell_codon_frequency`

.

To get the most different genes, you need to sort the
`distance`

vector

`sort(distance, decreasing = TRUE)`

but this will give you all the genes, and we want only the first 6. We can use an index

`sort(distance, decreasing = TRUE) [1:6]`

or we can use the function `head()`

, which —who would
guess— gives us the first 6 elements

`head(sort(distance, decreasing = TRUE))`

The problem is that we get the values, not the names. One solution is
to assign names to the `distance`

vector, and then look at
the names of the top genes.

```
names(distance) <- names(genes)
names(head(sort(distance, decreasing = TRUE)))
```

Another possibility is to replace `sort()`

by
`order()`

, which is a more general way to solve these
questions:

`names(genes)[head(order(distance, decreasing = TRUE))]`