15 November 2017

## We have a set of nucleotides

T, C, A, A, A, G, C, A, A, T, G, A, T, T, A, G, G, C, T, A, C, A, G, G, T, C, A, A, A, C, G, A, C, G, G, T, C, A, A, G, G, C, T, G, A, A, T, C, T, G, T, C, A, T, T, A, T, G, C, A, A, A, C, C, A, A, C, G, A, T, C, T, T, T, C, A, A, G, G, G, C, G, T, T, C, T, A, T, T, G, A, T, G, G, G, C, G, G, C, A, A, A, G, C, A, T, C, C, G, A, G, C, G, T, G, G, A, T, G, C, A, T, A, A, G, T, A, C, A, T, T, A, T, A, A, C, T, C, G, G, T, T, G, T, C, C, G, A, T and A

A C G T
45 31 37 37

## Abstraction (a.k.a. generalization)

• We have a vector $$\mathbf{x}=(x_1,\ldots, x_n)$$
• $$n$$ observations,
• each one with value $$x_i\in \Omega$$
• for $$i$$ taking values between 1 and $$n$$
• $$\Omega$$ is the sample space of $$\mathbf{x}$$
• i.e. the set of all possible outcomes
• Here $$\Omega=\{'A','C','G','T'\}$$

## The order of $$\mathbf{x}$$ is irrelevant

A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, C, C, C, C, C, C, C, C, C, C, C, C, C, C, C, C, C, C, C, C, C, C, C, C, C, C, C, C, C, C, C, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T and T

Each element $$x\in\Omega$$ appears $$N(x)$$ times, given by the table

A C G T
45 31 37 37

## Alternative formula

With $$N(x)$$ given by the table, we have the empirical frequency $p(x)=N(x)/n$ It contains all the information of $$\mathbf{x}$$ that we need

A C G T
0.3 0.2067 0.2467 0.2467

## Introduction to Probabilities

• A platonic ideal of data
• Populations as representations of infinite sets
• Distributions as descriptions of populations

## A vector of infinite size

• Now we imagine that we have a vector infinite size
• The set of possible outcomes $$\Omega$$ does not change
• Instead of a vector we have a population
• The size $$n$$ goes to infinite, but $$p(x)$$ remains bounded
• Instead of an infinite vector we can imagine a machine or device that creates new elements

## Distributions

• As before, all information about the population is contained in the frequency $$p(x)=N(x)/n$$
• Each population has a corresponding distribution
• Usually the distribution is a formula
• The formula depends on the mechanics of the process
• It also depends on some parameters (constants)
• The parameters depend on the specific conditions

## Probability

For our purposes $$p(x)$$ can be used as a probability

To be formally correct we should write $\Pr(X=x)=p(x)$ The argument to $$\Pr$$ is an expression which can be true or false

## Examples: Coin

• Has 2 sides: $$\Omega=\{\text{'Head'}, \text{'Tail'}\}$$
• Distribution given by $$p(\text{'Head'})$$ and $$p(\text{'Tail'})$$
• Extra rule $p(\text{'Head'}) + p(\text{'Tail'})= 1$ This is the mechanics of the coin
• If the coin is symmetric then $p(\text{'Head'}) = p(\text{'Tail'}) = 0.5$ This parameter depends on the specific coin

## Examples: Nucleotides

• Four possibilities: A, C, G, T
• Distribution given by $$p(\text{'A'}), p(\text{'C'}), p(\text{'G'})$$ and $$p(\text{'T'})$$
• Extra rule $p(\text{'A'})+ p(\text{'C'})+ p(\text{'G'})+p(\text{'T'})= 1$
• The exact values will depend on the organism

## How to know the distribution?

• The formula depends on the mechanics of the process
• We have to understand the physics of it
• e.g. DNA has 4 nucleotides
• It also depends on some parameters
• We need experimental data to find them
• Chagraff rules
• GC content

## Samples

When we do any experiment we get some of the elements of the population

A sample is a subset of the population

Two questions:

• What can we say about the sample?
• What can we say about the population?

## Two dice (🎲+🎲)

We throw two dice at the same time. What will be the sum of both numbers?

X <- 🎲
Y <- 🎲
Z <- X + Y

## Prob of $$Z$$ given that $$Y=3$$

$\Pr(Z\,\vert \, Y=3)$

## Information changes probabilities

If we do not know anything about $$Y$$, the probability $$\Pr(Z)$$ is

2 3 4 5 6 7 8 9 10 11 12
0.028 0.056 0.083 0.11 0.14 0.17 0.14 0.11 0.083 0.056 0.028

If we know that $$Y=3$$ then $$\Pr(Z\vert Y=3)$$ is

4 5 6 7 8 9
0.17 0.17 0.17 0.17 0.17 0.17

## Conditional probability

The probability of an event depends on our knowledge

Information about an event of $$Z$$ may change the probabilities of $$Y$$

$$\Pr(Y\vert Z)$$ may not be the same as $$\Pr(Y)$$

That is the general case. We should not expect that two events are a priori independent

## Independence

The variables $$Y$$ and $$Z$$ are independent if knowing any event of $$Z$$ does not change the probabilities of $$Y$$ $\Pr(Y\vert Z)=\Pr(Y)$ By symmetry, knowing events about $$Y$$ does not change the probabilities for $$Z$$ $\Pr(Z\vert Y)=\Pr(Z)$ We can write $$Y\perp Z$$

## Joint probability

If two experiments $$Y$$ and $$Z$$ are performed, we can study the probability of events on $$Y$$ and events on $$Z$$

The probability for both events is then $\Pr(Y=a, Z=b) = \Pr(Y=a\vert Z=b)\cdot\Pr(Z=b)$ or in short $\Pr(Y, Z) = \Pr(Y\vert Z)\cdot\Pr(Z)$

## Joint probability

The probability of an event on $$Y$$ and on $$Z$$ can be seen in two parts

• The probability that the event $$Z$$ happens, and
• The probability that $$Y$$ happens, given that $$Z$$ happened $\Pr(Y, Z) = \Pr(Y\vert Z)\cdot\Pr(Z)$

## Independence and joint prob.

The joint probability is always $\Pr(Y, Z) = \Pr(Y\vert Z)\cdot\Pr(Z)$ If $$Y$$ and $$Z$$ are independent, then $\Pr(Y\vert Z)=\Pr(Y)$ Replacing the second equation into the first, we have $\Pr(Y, Z) = \Pr(Y)\cdot\Pr(Z)\quad\text{ if }Y\perp Z$

## Application: diagnosis

Imagine we have a test to determine if someone has HIV.

Let’s assume that:

• There are $$10^{5}$$ persons tested
• The test has a precision of 99%
• The incidence of HIV in the population is 0.1%

## Let’s fill this matrix

Test- Test+ Total
HIV- . . .
HIV+ . . .
Total . . .

Test- Test+ Total
HIV- . . .
HIV+ . . .
Total . . 100000

## 0.1% of them are $$\text{HIV}_+$$

Test- Test+ Total
HIV- . . 99900
HIV+ . . 100
Total . . 100000

$\Pr(\text{HIV}_+)=0.001$

## 99% are correctly diagnosed

Test- Test+ Total
HIV- . . 99900
HIV+ . 99 100
Total . . 100000

$\Pr(\text{test}_+, \text{HIV}_+)=\Pr(\text{test}_+ \vert \text{HIV}_+)\cdot\Pr(\text{HIV}_+)$ $\Pr(\text{test}_+ \vert \text{HIV}_+)=0.99$

## 99% are correctly diagnosed

Test- Test+ Total
HIV- 98901 . 99900
HIV+ . 99 100
Total . . 100000

$\Pr(\text{test}_-, \text{HIV}_-)=\Pr(\text{test}_- \vert \text{HIV}_-)\cdot\Pr(\text{HIV}_-)$ $\Pr(\text{test}_- \vert \text{HIV}_-)=0.99$

## 1% are mis-diagnosed

Test- Test+ Total
HIV- 98901 999 99900
HIV+ 1 99 100
Total . . 100000

$\Pr(\text{test}_-, \text{HIV}_+)=\Pr(\text{test}_- \vert \text{HIV}_+)\cdot\Pr(\text{HIV}_+)$ $\Pr(\text{test}_- \vert \text{HIV}_+)=0.01$

## Total people diagnosed: $$\text{test}_+$$

Test- Test+ Total
HIV- 98901 999 99900
HIV+ 1 99 100
Total 98902 1098 100000

$\Pr(\text{test}_+)= \Pr(\text{test}_+, \text{HIV}_+)+ \Pr(\text{test}_+, \text{HIV}_-)$

What is the probability of being sick given that the test is positive?

## True positive rate

Test- Test+ Total
HIV- 98901 999 99900
HIV+ 1 99 100
Total 98902 1098 100000

$\Pr(\text{test}_+, \text{HIV}_+)=\Pr(\text{HIV}_+ \vert \text{test}_+)\cdot\Pr(\text{test}_+)$ $\Pr(\text{HIV}_+ \vert \text{test}_+)=\frac{\Pr(\text{test}_+, \text{HIV}_+)}{\Pr(\text{test}_+)}=\frac{99}{1098} = 9\%$

## Bayes Theorem

“An Essay towards solving a Problem in the Doctrine of Chances” is a work on the mathematical theory of probability by the Reverend Thomas Bayes, published in 1763, two years after its author’s death

The use of the Bayes theorem has been extended in science and in other fields

## Bayes rule

Since $\Pr(Y, Z) = \Pr(Y\vert Z)\cdot\Pr(Z)$ and, by symmetry $\Pr(Y, Z) = \Pr(Z\vert Y)\cdot\Pr(Y)$ then $\Pr(Y\vert Z) = \frac{\Pr(Z\vert Y)\cdot\Pr(Y)}{\Pr(Z)}$

## What does it mean

It can be understood as $\Pr(Y\vert Z) = \frac{\Pr(Z\vert Y)}{\Pr(Z)}\cdot\Pr(Y)$ which is a rule to update our opinions

• $$\Pr(Y)$$ is the a priori probability
• $$\Pr(Y\vert Z)$$ is a posteriori probability

Bayes says how to change $$\Pr(Y)$$ when we learn $$Z$$

“When the facts change, I change my opinion. What do you do, sir?”

John Maynard Keynes (1883 – 1946), English economist, “father” of macroeconomics

## Inversion rule

Another point of view is $\Pr(Y\vert Z) = \Pr(Z\vert Y)\cdot\frac{\Pr(Y)}{\Pr(Z)}$ which is a rule to invert the conditional probability

This is the view we will use now

## Formalization

We have two variables:

• The DNA sequence: $$\mathbf{X}=(X_1,\ldots,X_m)$$
• The presence or absence of a binding site: $$B_+$$ or $$B_-$$

We do an “experiment” and get a short DNA sequence $$\mathbf{x}=(s_1,\ldots,s_m)$$

We want $$\Pr(B_+\vert \mathbf{X}=\mathbf{x})$$

## How do we get it?

We want $$\Pr(B_+\vert \mathbf{X}=\mathbf{x})$$

Applying Bayes’ theorem we have $\Pr(B_+\vert \mathbf{X}=\mathbf{x})= \frac{\Pr(\mathbf{X}=\mathbf{x}\vert B_+)\cdot\Pr(B_+)}{\Pr(\mathbf{X}=\mathbf{x})}$ so we need to find them

## What do we have

We have a matrix $$\mathbf{M}$$ with the empirical frequencies of nucleotides in $$n$$ sequences

$$\mathbf{M}$$ has 4 rows (A, C, T, G) and $$m$$ columns

$$M_{ij}=$$ number of times nucleotide $$i$$ is at position $$j$$

The sum of each column of $$\mathbf{M}$$ is $$n$$

## Translating to probabilities

We assume that these sequences are outcomes of a probabilistic process

That is, the sequences follow some probability

We don’t know the exact probability

But we can approximate $\Pr(X_j=i\vert B_+)=M_{ij}/n$ for $$i\in\{A,C,T,G\}$$

## Independence hypotesis

We also assume that the probabilities of each $$X_j$$ are independent

In such case we have $\Pr(\mathbf{X}=\mathbf{x}\vert B_+)= \Pr(X_1=s_1\vert B_+) \cdots \Pr(X_m=s_m\vert B_+)$ or, in short $\Pr(\mathbf{X}=\mathbf{x}\vert B_+)= \prod_{j=1}^m\Pr(X_j=s_j\vert B_+)$

## A priori distribution

Using the same hypothesis of independence, we have $\Pr(\mathbf{X}=\mathbf{x})= \Pr(X_1=s_1) \cdots \Pr(X_m=s_m)$ or, in short $\Pr(\mathbf{X}=\mathbf{x})= \prod_{j=1}^m\Pr(X_j=s_j)$ Usually $$\Pr(X_j=i)$$ is approximated by the frequency of each nucleotide in the complete genome $\Pr(X_j=i)=\frac{N_i}{L}$

## What is missing?

We got “good” guesses of $$\Pr(\mathbf{X}=\mathbf{x}\vert B_+)$$ and $$\Pr(\mathbf{X}=\mathbf{x})$$

We need $$\Pr(B_+)$$

How do we get it?

There is no easy answer for that

## Anyway…

Let’s say $$\Pr(B_+)=K$$ and later we solve it

Applying Bayes’ theorem we have $\Pr(B_+\vert \mathbf{X}=\mathbf{x})=\prod_{j=1}^m \frac{\Pr(X_j=s_j\vert B_+)}{\Pr(X_j=s_j)}\cdot K$

Can it be simpler?

## Logarithms…

…are made to change multiplications into sums

$\log\Pr(B_+\vert \mathbf{X}=\mathbf{x})=\sum_{j=1}^m \log\frac{\Pr(X_j=s_j\vert B_+)}{\Pr(X_j=s_j)} + mK$

## Score

For each sequence $$\mathbf{x}$$ we calculate the score

$\mathrm{Score}(\mathbf{x}) =\sum_{j=1}^m Q_{s_j,j} =\sum_{j=1}^m\log\frac{\Pr(X_j=s_j\vert B_+)}{\Pr(X_j=s_j)}+Const$

We prepare a matrix $$\mathbf{Q}$$ for each binding site $Q_{i,j}=\log\frac{M_{ij}}{N_i}$

## Homework

Write a program (in any computer language) to calculate the score of each position of a genome