Blog of Andrés Aravena
CMB2:

Homework for class 8

24 March 2016

The library seqinr is needed in all the following exercises.

Finding all ORFs in bacteria

Prepare a RMarkdown document with the following parts:

  1. Write a function named find.codons() with 3 inputs: a genome (a vector of char), a strand (a char, either “+” or “-”) and a frame number (either 0, 1 or 2). The function should return a vector with all codons in that reading frame. You can use the function splitseq(). Verify that all codons have length 3.
  2. Apply this function on the genome of E.coli for strands “+” and “-” and for frame 0, 1 and 2. Store the six results on a list named codons with 6 elements.
  3. Write a function that, given a list of codons, finds the position of all start codons. You can use the function which. Apply this function to any of the elements of the list codons and store the result in a vector named start.
  4. Write a similar function to find the position of the stop codons. Store them in a vector named stop.
  5. (Optional) Write a function that, given the vectors start and stop, returns a vector of the lengths of all ORFs. Notice that there may be more than one start codon for each stop codon. Which one is the real start?

Combine a FASTA file and a GFF file to get the CDS sequences

Write a RMarkdown document that combines the following steps

  1. Read the genome of the bacteria from a FASTA file. For example you can use NC_000913.fna. Store the first element of the result in a vector named genome.

  2. Read the description of the features from a GFF file and store them in a data frame named features.

  3. Write a function that takes the features data frame and returns a new data frame with one row for each CDS and three columns: start, stop and strand. Store the result in a variable named CDS.

  4. Write a function that takes a number n, a data frame CDS and a vector genome and returns the nucleotidic sequence of the gene described in the line n of the data frame CDS.

  5. Write a function that combines the CDS data frame and the genome vector to produce a list with the nucleotidic sequences of all genes. Each gene sequence is a vector of char.

    You can use lapply on the function from the previous question. Store the result in a variable named CDS.sequences.

  6. Write a function that takes the CDS.sequences list and produces a list with the aminoacidic sequences of all proteins coded genes. Store the result in protein.sequences.

    You can assume that the genetic code is the bacterial one.

  7. Write the values of CDS.sequences into a FASTA file. See the function write.fasta().

  8. Write another FASTA file with the aminoacidic sequence of the coded proteins. See the translate() function.

The names of the genes in the FASTA output files can be a serial number. See the paste() function if you want to do something nicer.

Finding Binding Sites

  1. Write a function score.position() to evaluate the score of a position for a given position weight matrix. These matrices have 4 rows and 5 to 30 columns. Rows have names corresponding to nucleotides.

    The function should work with any position weight matrix. For testing you can download the matrix here. You can read it with

    PWM <- read.delim("PWM.txt", header=FALSE, row.names=1)

    Inputs:

    • pos: position in the genome
    • genome: vector of chars
    • mat: a position specific score matrix

    Output: the score

  2. Evaluate it on each position of E.coli genome.