December 4th, 2017

## Graphical representation of multiple alignment

A Sequence Logo shows the frequency of each residue and the level of conservation This is the logo for the multiple alignment of a transcription factor binding site. See www.prodoric.de

## Alignment shows what is shared among sequences

We align sequences of some class. The alignment can show what characterizes the class.

From the empirical (position specific) frequencies we can estimate the probability $\Pr(s_k=\text{"A"}\vert s\text{ is a Binding Site})$ that position $$k$$ in sequence $$s$$ is an ‘A’, given that $$s$$ is a binding site

## Conservation is seen on probability

Not all probability distributions are equal. If \begin{aligned}\Pr(s_k=\text{"A"}\vert s\in BS)&=\Pr(s_k=\text{"C"}\vert s\in BS)\\ &=\Pr(s_k=\text{"G"}\vert s\in BS)\\&=\Pr(s_k=\text{"T"}\vert s\in BS)\end{aligned} then the residue is completely random and we do not know anything. On the other side, if $\Pr(s_k=\text{"A"}\vert s\in BS)=1 \qquad \Pr(s_k\not=\text{"A"}\vert s\in BS)=0$ then we know everything

## Conservation level is measured in bits

We measure information in bits

Since DNA alphabet $$\cal A$$ has four symbols we have at most 2 bits per position. The formula is

$2+\sum_{X\in\cal A}\Pr(s_k=X\vert s\in BS)\log_2 \Pr(s_k=X\vert s\in BS)$

## Using this result to look for more sequences

Basic probability theory says that if $$X$$ is any given sequence, then the probability that it is a binding site is $\Pr(BS\vert X)=\frac{\Pr(X\vert BS)\Pr(BS)}{\Pr(X)}$ Under some hypothesis we also have $\Pr(X\vert BS)=\Pr(s_1=X_1\vert BS)\Pr(s_2=X_2\vert BS)\cdots\Pr(s_n=X_n\vert BS)$ This is called naive bayes model

## Scoring

Using logarithms we will have \begin{aligned} \log\Pr(BS\vert X)=\log\frac{\Pr(X\vert BS)}{\Pr(X)}+\log\Pr(BS)\\ \log\Pr(BS\vert X)=\sum_{k=1}^n\log\frac{\Pr(s_k=X_k\vert BS)}{\Pr(s_k=X_k)}+\text{Const.} \end{aligned} Thus the sequences that most probably are BS will also have a big value for this formula

## Position Specific Scoring Matrices

The score of any sequence $$X$$ will be $\text{Score}(X)=\sum_{k=1}^n\log\frac{\Pr(s_k=X_k\vert BS)}{\Pr(s_k=X_k)}$ So the total score is the sum of the score of each column.

We can write the score function as a $$4\times n$$ matrix

These are used to find new binding sites

## Exercise

• Go To http://meme-suite.org/
• Use MAST to find all instances of Fur motifs on S.pombe
• What are the differences between FIMO, MAST, MCAST and GLAM2Scan?
• What are the differences between MEME, DREME, MEME-ChIP, GLAM2 and MoMo?