May 18, 2018

This class is incomplete

This class is based on old material and is not completely updated.

It will be updated later.

My apologies for any confusing material.

Examples: Coin

  • Has 2 sides: ‘Head’ and ‘Tail’
  • These options are exhaustive and exclusive \[\Pr(\text{'Head'}|Z) + \Pr(\text{'Tail'}|Z)= 1\]
  • If the coin is symmetric then \[\Pr(\text{'Head'}|Z) = \Pr(\text{'Tail'}|Z) = 0.5\]

Examples: Nucleotides

  • Four possibilities: \(\text{'A'}, \text{'C'}, \text{'G'}\) and \(\text{'T'}.\)
  • These outcomes are exhaustive and exclusive \[\Pr(\text{'A'}|Z)+ \Pr(\text{'C'}|Z)+ \Pr(\text{'G'}|Z)+\Pr(\text{'T'}|Z)= 1\]
  • The values depend on the organism

How to know the distribution?

  • The formula depends on the mechanics of the process
    • We have to understand the physics of it
    • e.g. DNA has 4 nucleotides
  • It also depends on some parameters
    • We need experimental data to find them
    • Chagraff rules
    • GC content

Two dice (🎲+🎲)

We throw two dice at the same time. What will be the sum of both numbers?

A <- 🎲
B <- 🎲
C <- A + B

Visually

all dots have the same probability

What is your bet?

More Information

We know somehow that \(B>3\)

Conditional probability

\[\Pr(C\,|\, (B>3)\wedge Z)\]

Prob of \(C\) given that \(B=3\)

Prob of \(C\) given that \(B=3\)

\[\Pr(C\,|\, (B=3)\wedge Z)\]

Information changes probabilities

If \(Z\) does not say anything about \(B\), the probability \(\Pr(C\vert Z)\) is

2 3 4 5 6 7 8 9 10 11 12
2.8 5.6 8.3 11 14 17 14 11 8.3 5.6 2.8

If we know that \(B=3\) then \(\Pr(C\vert B=3\wedge Z)\) is

4 5 6 7 8 9
17 17 17 17 17 17

Probability

A probability associates each event with a number

In the case of simple events we have \[\Pr(X=x)=\Pr(x)\]

More complex events can be evaluated by decomposing them into simpler ones \[\Pr(X\text{ is purine})=\Pr(X=\text{'A'})+\Pr(X=\text{'G'})\]

Two dice

We throw two dice at the same time. What will be the sum of both numbers?

\(B\) and \(C\) at the same time

All have the same probability

Prob of \(C\) given that \(B=3\)

Prob of \(B\) given \(C\)

Conditional probability

The probability of an event depends on our knowledge

Information about an event of \(C\) may change the probabilities of \(B\)

\(\Pr(B\vert C\wedge Z)\) may not be the same as \(\Pr(B\vert Z)\)

That is the general case. We should not expect that two events are a priori independent

Independence

The variables \(B\) and \(C\) are independent if knowing any event of \(C\) does not change the probabilities of \(B\) \[\Pr(B|C\wedge Z)=\Pr(B\vert Z)\] By symmetry, knowing events about \(B\) does not change the probabilities for \(C\) \[\Pr(C|B\wedge Z)=\Pr(C\vert Z)\] We can write \(B\perp C\)

Joint probability

If two experiments \(B\) and \(C\) are performed, we can study the probability of events on \(B\) and events on \(C\)

The probability for both events is then \[\Pr(B=a, C=b) = \Pr(B=a|C=b)\cdot\Pr(C=b)\] or in short \[\Pr(B, C) = \Pr(B|C)\cdot\Pr(C)\]

Joint probability

The probability of an event on \(B\) and on \(C\) can be seen in two parts

  • The probability that the event \(C\) happens, and
  • The probability that \(B\) happens, given that \(C\) happened \[\Pr(B, C) = \Pr(B|C)\cdot\Pr(C)\]

Independence and joint prob.

The joint probability is always \[\Pr(B, C\vert Z) = \Pr(B|C\wedge Z)\cdot\Pr(C\vert Z)\] If \(B\) and \(C\) are independent, then \[\Pr(B|C\wedge Z)=\Pr(B\vert Z)\] Replacing the second equation into the first, we have \[\Pr(B, C\vert Z) = \Pr(B\vert Z)\cdot\Pr(C\vert Z)\quad\text{ if }B\perp C\]

Application: diagnosis

Imagine we have a test to determine if someone has HIV.

Let’s assume that:

  • There are \(10^{5}\) persons tested
  • The test has a precision of 99%
  • The incidence of HIV in the population is 0.1%

Let’s fill this matrix

  Test- Test+ Total
HIV- . . .
HIV+ . . .
Total . . .

We start with the total population

  Test- Test+ Total
HIV- . . .
HIV+ . . .
Total . . 100000

0.1% of them are \(\text{HIV}_+\)

  Test- Test+ Total
HIV- . . 99900
HIV+ . . 100
Total . . 100000

\[\Pr(\text{HIV}_+)=0.001\]

99% are correctly diagnosed

  Test- Test+ Total
HIV- . . 99900
HIV+ . 99 100
Total . . 100000

\[\Pr(\text{test}_+, \text{HIV}_+)=\Pr(\text{test}_+ \vert \text{HIV}_+)\cdot\Pr(\text{HIV}_+)\] \[\Pr(\text{test}_+ \vert \text{HIV}_+)=0.99\]

99% are correctly diagnosed

  Test- Test+ Total
HIV- 98901 . 99900
HIV+ . 99 100
Total . . 100000

\[\Pr(\text{test}_-, \text{HIV}_-)=\Pr(\text{test}_- \vert \text{HIV}_-)\cdot\Pr(\text{HIV}_-)\] \[\Pr(\text{test}_- \vert \text{HIV}_-)=0.99\]

1% are mis-diagnosed

  Test- Test+ Total
HIV- 98901 999 99900
HIV+ 1 99 100
Total . . 100000

\[\Pr(\text{test}_-, \text{HIV}_+)=\Pr(\text{test}_- \vert \text{HIV}_+)\cdot\Pr(\text{HIV}_+)\] \[\Pr(\text{test}_- \vert \text{HIV}_+)=0.01\]

Total people diagnosed: \(\text{test}_+\)

  Test- Test+ Total
HIV- 98901 999 99900
HIV+ 1 99 100
Total 98902 1098 100000

\[\Pr(\text{test}_+)= \Pr(\text{test}_+, \text{HIV}_+)+ \Pr(\text{test}_+, \text{HIV}_-)\]

What is the probability of being sick given that the test is positive?

True positive rate

  Test- Test+ Total
HIV- 98901 999 99900
HIV+ 1 99 100
Total 98902 1098 100000

\[\Pr(\text{test}_+, \text{HIV}_+)=\Pr(\text{HIV}_+ \vert \text{test}_+)\cdot\Pr(\text{test}_+)\] \[\Pr(\text{HIV}_+ \vert \text{test}_+)=\frac{\Pr(\text{test}_+, \text{HIV}_+)}{\Pr(\text{test}_+)}=\frac{99}{1098} = 9\%\]

Bayes Theorem

Bayes Theorem

From Wikipedia, the free encyclopedia

“An Essay towards solving a Problem in the Doctrine of Chances” is a work on the mathematical theory of probability by the Reverend Thomas Bayes, published in 1763, two years after its author’s death

The use of the Bayes theorem has been extended in science and in other fields

Bayes rule

Since \[\Pr(B, C\vert Z) = \Pr(B|C,Z)\cdot\Pr(C\vert Z)\] and, by symmetry \[\Pr(B, C\vert Z) = \Pr(C|B,Z)\cdot\Pr(B\vert Z)\] then \[\Pr(B|C) = \frac{\Pr(C|B,Z)\cdot\Pr(B\vert Z)}{\Pr(C\vert Z)}\]

One of the most famous theorems

What does it mean

It can be understood as \[\Pr(B|C\wedge Z) = \frac{\Pr(C|B\wedge Z)}{\Pr(C\vert Z)}\cdot\Pr(B\vert Z)\] which is a rule to update our opinions

  • \(\Pr(B\vert Z)\) is the a priori probability
  • \(\Pr(B|C\wedge Z)\) is a posteriori probability

Bayes says how to change \(\Pr(B\vert Z)\) when we learn \(C\)

“When the facts change, I change my opinion. What do you do, sir?”

John Maynard Keynes (1883 – 1946), English economist, “father” of macroeconomics

Inversion rule

Another point of view is \[\Pr(B|C\wedge Z) = \Pr(C|B\wedge Z)\cdot\frac{\Pr(B\vert Z)}{\Pr(C\vert Z)}\] which is a rule to invert the conditional probability

This is the view we will use now

Detecting binding sites

Formalization

We have two variables:

  • The DNA sequence: \(\mathbf{X}=(X_1,\ldots,X_m)\)
  • The presence or absence of a binding site: \(B_+\) or \(B_-\)

We do an “experiment” and get a short DNA sequence \(\mathbf{x}=(s_1,\ldots,s_m)\)

We want \(\Pr(B_+|\mathbf{X}=\mathbf{x})\)

How do we get it?

We want \(\Pr(B_+|\mathbf{X}=\mathbf{x})\)

Applying Bayes’ theorem we have \[\Pr(B_+|\mathbf{X}=\mathbf{x})= \frac{\Pr(\mathbf{X}=\mathbf{x}|B_+)\cdot\Pr(B_+)}{\Pr(\mathbf{X}=\mathbf{x})}\] so we need to find them

What do we have

We have a matrix \(\mathbf{M}\) with the empirical frequencies of nucleotides in \(n\) sequences

\(\mathbf{M}\) has 4 rows (A, C, T, G) and \(m\) columns

\(M_{ij}=\) number of times nucleotide \(i\) is at position \(j\)

The sum of each column of \(\mathbf{M}\) is \(n\)

Translating to probabilities

We assume that these sequences are outcomes of a probabilistic process

That is, the sequences follow some probability

We don’t know the exact probability

But we can approximate \[\Pr(X_j=i|B_+)=M_{ij}/n\] for \(i\in\{A,C,T,G\}\)

Independence hypothesis

We also assume that the probabilities of each \(X_j\) are independent

In such case we have \[\Pr(\mathbf{X}=\mathbf{x}|B_+)= \Pr(X_1=s_1|B_+) \cdots \Pr(X_m=s_m|B_+)\] or, in short \[\Pr(\mathbf{X}=\mathbf{x}|B_+)= \prod_{j=1}^m\Pr(X_j=s_j|B_+)\]

A priori distribution

Using the same hypothesis of independence, we have \[\Pr(\mathbf{X}=\mathbf{x})= \Pr(X_1=s_1) \cdots \Pr(X_m=s_m)\] or, in short \[\Pr(\mathbf{X}=\mathbf{x})= \prod_{j=1}^m\Pr(X_j=s_j)\] Usually \(\Pr(X_j=i)\) is approximated by the frequency of each nucleotide in the complete genome \[\Pr(X_j=i)=\frac{N_i}{L}\]

What is missing?

We got “good” guesses of \(\Pr(\mathbf{X}=\mathbf{x}|B_+)\) and \(\Pr(\mathbf{X}=\mathbf{x})\)

We need \(\Pr(B_+)\)

How do we get it?

There is no easy answer for that

Anyway…

Let’s say \(\Pr(B_+)=K\) and later we solve it

Applying Bayes’ theorem we have \[\Pr(B_+|\mathbf{X}=\mathbf{x})=\prod_{j=1}^m \frac{\Pr(X_j=s_j|B_+)}{\Pr(X_j=s_j)}\cdot K\]

Can it be simpler?

Logarithms…

…are made to change multiplications into sums

\[\log\Pr(B_+|\mathbf{X}=\mathbf{x})=\sum_{j=1}^m \log\frac{\Pr(X_j=s_j|B_+)}{\Pr(X_j=s_j)} + mK\]

Score

For each sequence \(\mathbf{x}\) we calculate the score

\[\mathrm{Score}(\mathbf{x}) =\sum_{j=1}^m Q_{s_j,j} =\sum_{j=1}^m\log\frac{\Pr(X_j=s_j|B_+)}{\Pr(X_j=s_j)}+Const\]

We prepare a matrix \(\mathbf{Q}\) for each binding site \[Q_{i,j}=\log\frac{M_{ij}}{N_i}\]

Homework

Write a program (in any computer language) to calculate the score of each position of a genome