Blog of Andrés Aravena
CMB2:

Homework 5

11 March 2020. Deadline: Thursday, 19 March, 12:30.

We need to do a lot of exercises to be ready for the midterm. Here you have several exercises. Some of them can be answered in short time, others require more thinking. Start thinking all of them. The deadline is valid only for the short term questions. Long term questions should be answered before the midterm exam.

Please use the official template for answers.

Short term questions

Calculate the GC content for only part of the genome

Instead of all the genome, we only look through a window. That is, we look only a region of the genome, with a fixed size, and starting in a given position. For example, we examine only the genome region starting at position 250000 and we look only for 100 letters. That is, only letters in the positions in seq(from=250000, length=1000).

The result should depend on:

  • the genomic sequence
  • the position of the window
  • the size of the window

Write a function called window_gc_content(), that takes sequence, position, and size as input, and returns a single value with the window GC content. You can test this function with the genome of E.coli follwing these steps

  • Download the genome of E.coli from NCBI or from the blog. Take note of the folder where the file is downloaded. Different web browsers may use different folders.

  • Load library(seqinr). If you do not have it installed, pleas install it.

  • Set your working directory to the folder where the file was downloaded.

  • Read the sequences with the command sequences <- read.fasta("NC_000913.fna"). Be careful that the file may have a different name in your computer.

  • Then you can test using the command

    window_gc_content(sequences[[1]], 250000, 100)

Using window_gc_content() in many places

We want to evaluate window_gc_content on different positions of the genome. Specifically, we want to evaluate in these positions:

positions <- seq(from=1, to=length(genome)-window_size, by= window_size)

Obviously, the result depends on the genome and window_size. Please write a function that takes as inputs genome and window_size, and returns a vector with the GC content of each of the windows in each of the positions.

GC Skew

Write a function that takes a list of genes, and calculate the ratio (nG-nC)/(nG+nC) for each gene. The function should be called gene_gc_skew and takes only one input: a list called genes. What should be the output?

Long term questions

Algorithm design

In many important cases we have a vector x with growing values. That is, each value is bigger or equal to the previous one, so

x[i+1] >= x[i]

for all values of the index i. It is easy to see that the position of the minimum value has to be 1. We also know that the position of the maximum value is the last position. What about the position of the half value?

The half value is the average of the minimum and the maximum. For example if x is the vector c(1, 4, 4, 6, 10, 15) then the half value is (1+15)/2, that is 8.

The position of the half value of the vector x is the index of the first value that is equal or bigger than the half value of x. In the example the position of the half value is 5, since x[5] is the smallest value that is bigger or equal than 8.

Please write a function called position_of_half(), with one input called x. The function must return a single number, which is the index of the smallest value in x that is bigger than or equal to the average of minimum and maximum of x.

You can test your functions with the following code.

x <- 1:9
position_of_half(x)
position_of_half(x + 20)
position_of_half(x * x)
position_of_half(sqrt(x))

The answers should be 5, 5, 7, 4, respectively.

Merge two sorted vectors

Please write a function called vector_merge(x, y) that receives two sorted vectors x and y and returns a new vector with the elements of x and y together sorted. The output vector has size length(x)+length(y).

You must assume that each of the input vectors is already sorted.

in your code you have to use three indices: i, j, and k; to point into x, y and the output vector answer, respectively. On each step you have to compare x[i] and y[j]. If x[i] < y[j] then you make answer[k] <- x[i], otherwise make answer[k] <- y[j].

You have to increment i or j, and k carefully. To test your function, you can use this code:

x <- c("a", "d", "e", "h", "i", "k", "m", "s", "t", "u", "v", "w", "z")
y <- c("b", "c", "f", "g", "j", "l", "n", "o", "p", "q", "r", "x", "y")
vector_merge(x, y)

The output must be a sorted alphabet.

"a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m"
"n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"

Deadline: Thursday, 19 March, 12:30.