Transfac is a database of transcription factors binding sites. Entries are stored in text format, as follows
ID prodoric_MX000038 BF Streptococcus pyogenes P0 A T G C 00 8 6 3 3 n 01 11 7 1 1 a 02 8 7 2 3 n 03 7 7 3 3 n 04 19 0 1 0 a 05 0 19 1 0 t 06 0 20 0 0 T 07 19 1 0 0 a 08 11 1 6 2 a 09 18 1 1 0 a 10 9 4 5 2 n XX //
Each row represents a position on the binding site motif, and each entry is the number of times we observe each nucleotide in that position among all instances of experimentally proved binding sites.
Basically, we have 𝑛 sequences (20 in this example) of binding sites for a single transcription factor which have been proved experimentally. We want to generalize them into a Position Specific Scoring Matrix which gives us the score of each nucleotide on each position.
Please load this matrix on a spreadsheet and evaluate the score according to Class 28.
In other words, make a program to do the same as the Spreadsheet of the previous question. In this case the input may be different, and it may have a different number of rows than in the example.
You can find other matrices to try at https://www.prodoric.de/matrix/. Be sure to download it in the Transfac format.
To test if a given sequence is probably a binding site, we need to calculate its score. Write a program that reads a Position Specific Scoring Matrix and a genomic sequence (in FASTA format), and calculate the score of every position in the genome. The output should be a list of (position, score).
(Bonus) How would you choose a threshold for the score, so we may show only significant positions and scores?