A **Sequence Logo** shows the frequency of each residue and the level of conservation This is the logo for the multiple alignment of a *transcription factor binding site*. See **www.prodoric.de**

December 4th, 2017

A **Sequence Logo** shows the frequency of each residue and the level of conservation This is the logo for the multiple alignment of a *transcription factor binding site*. See **www.prodoric.de**

We align sequences of some class. The alignment can show what characterizes the class.

From the empirical (position specific) frequencies we can estimate the probability \[\Pr(s_k=\text{"A"}\vert s\text{ is a Binding Site})\] that position \(k\) in sequence \(s\) is an ‘A’, given that \(s\) is a binding site

Not all probability distributions are equal. If \[\begin{aligned}\Pr(s_k=\text{"A"}\vert s\in BS)&=\Pr(s_k=\text{"C"}\vert s\in BS)\\ &=\Pr(s_k=\text{"G"}\vert s\in BS)\\&=\Pr(s_k=\text{"T"}\vert s\in BS)\end{aligned}\] then the residue is completely random and we do not know anything. On the other side, if \[\Pr(s_k=\text{"A"}\vert s\in BS)=1 \qquad \Pr(s_k\not=\text{"A"}\vert s\in BS)=0\] then we know everything

Some probability distributions give more information. How much?

We measure information in bits

Since DNA alphabet \(\cal A\) has four symbols we have at most 2 bits per position. The formula is

\[2+\sum_{X\in\cal A}\Pr(s_k=X\vert s\in BS)\log_2 \Pr(s_k=X\vert s\in BS)\]

Basic probability theory says that if \(X\) is any given sequence, then the probability that it is a binding site is \[\Pr(BS\vert X)=\frac{\Pr(X\vert BS)\Pr(BS)}{\Pr(X)}\] Under some hypothesis we also have \[\Pr(X\vert BS)=\Pr(s_1=X_1\vert BS)\Pr(s_2=X_2\vert BS)\cdots\Pr(s_n=X_n\vert BS)\] This is called *naive bayes* model

Using logarithms we will have \[\begin{aligned} \log\Pr(BS\vert X)=\log\frac{\Pr(X\vert BS)}{\Pr(X)}+\log\Pr(BS)\\ \log\Pr(BS\vert X)=\sum_{k=1}^n\log\frac{\Pr(s_k=X_k\vert BS)}{\Pr(s_k=X_k)}+\text{Const.} \end{aligned}\] Thus the sequences that most probably are BS will also have a big value for this formula

The score of any sequence \(X\) will be \[\text{Score}(X)=\sum_{k=1}^n\log\frac{\Pr(s_k=X_k\vert BS)}{\Pr(s_k=X_k)}\] So the total score is the sum of the score of each column.

We can write the score function as a \(4\times n\) matrix

These are used to find new binding sites

- Go To http://meme-suite.org/
- Use
*MAST*to find all instances of*Fur*motifs on*S.pombe* - What are the differences between
*FIMO*,*MAST*,*MCAST*and*GLAM2Scan*? - What are the differences between
*MEME*,*DREME*,*MEME-ChIP*,*GLAM2*and*MoMo*?