seqinr is needed in all the following exercises.
Finding all ORFs in bacteria
Prepare a RMarkdown document with the following parts:
- Write a function named
find.codons()with 3 inputs: a genome (a vector of char), a strand (a char, either “+” or “-”) and a frame number (either 0, 1 or 2). The function should return a vector with all codons in that reading frame. You can use the function
splitseq(). Verify that all codons have length 3.
- Apply this function on the genome of E.coli for strands “+” and “-” and for frame 0, 1 and 2. Store the six results on a list named
codonswith 6 elements.
- Write a function that, given a list of codons, finds the position of all start codons. You can use the function
which. Apply this function to any of the elements of the list
codonsand store the result in a vector named
- Write a similar function to find the position of the stop codons. Store them in a vector named
- (Optional) Write a function that, given the vectors
stop, returns a vector of the lengths of all ORFs. Notice that there may be more than one start codon for each stop codon. Which one is the real start?
Combine a FASTA file and a GFF file to get the CDS sequences
Write a RMarkdown document that combines the following steps
Read the genome of the bacteria from a FASTA file. For example you can use
NC_000913.fna. Store the first element of the result in a vector named
Read the description of the features from a GFF file and store them in a data frame named
Write a function that takes the
featuresdata frame and returns a new data frame with one row for each CDS and three columns: start, stop and strand. Store the result in a variable named
Write a function that takes a number
n, a data frame
CDSand a vector
genomeand returns the nucleotidic sequence of the gene described in the line
nof the data frame
Write a function that combines the
CDSdata frame and the
genomevector to produce a list with the nucleotidic sequences of all genes. Each gene sequence is a vector of char.
You can use
lapplyon the function from the previous question. Store the result in a variable named
Write a function that takes the
CDS.sequenceslist and produces a list with the aminoacidic sequences of all proteins coded genes. Store the result in
You can assume that the genetic code is the bacterial one.
Write the values of
CDS.sequencesinto a FASTA file. See the function
Write another FASTA file with the aminoacidic sequence of the coded proteins. See the
The names of the genes in the FASTA output files can be a serial number. See the
paste() function if you want to do something nicer.
Finding Binding Sites
Write a function
score.position()to evaluate the score of a position for a given position weight matrix. These matrices have 4 rows and 5 to 30 columns. Rows have names corresponding to nucleotides.
The function should work with any position weight matrix. For testing you can download the matrix here. You can read it with
PWM <- read.delim("PWM.txt", header=FALSE, row.names=1)
pos: position in the genome
genome: vector of chars
mat: a position specific score matrix
Output: the score
Evaluate it on each position of E.coli genome.