December 4, 2018

Score’s units

Last week we saw that the score of an alignment can be changed if we mutiply the substitution matrix by a constant \(\lambda\) such that

\[\lambda S_{ab} = \log\frac{q_{ab}}{p_a p_b}\]

The value of \(\lambda\) depends on the base of \(\log()\)

It is like changing between meters and inches

Score’s units

The score is a sum of log of probabilities. Thus, it has dimensions of information

In practice there are two options for the logarithm, that result in two units for the score

  • It the base is 2, then the score units are bits
  • If the the base is \(e\) (natural log), then the score units are nats

NCBI has a mix. The score is shown in bits (more details later), but \(\lambda\) is calculated for natural logarithm

A little theory

“Recall that mathematics is a tool to meditate, not compute.”

Nassim Nicholas Taleb (born 1960)
Lebanese-American essayist, scholar, statistician, former trader, and risk analyst

Two ways to calculate averages

We know that for a small numeric sample \(\mathbf{x}=(x_1,\ldots,x_n)\) the average is \[\bar{\mathbf{x}} = \frac{1}{n} \sum_i x_i = \sum_x x\, \frac{N(x)}{n} = \sum_x x \, p(x)\] where \(N(x)\) is the number of times the value \(x\) is in the sample \(\mathbf{x}\)

The number \(p(x)=N(x)/n\) is the proportion of values \(x\) in the sample

Expected Value

When we work with “infinite populations”, i.e. when we use probabilities, we can generalize the idea of average.

In probabilities instead of “average” we say expected value and we write \[\mathbb E(X) = \sum_{x} x \, \Pr(X=x)\]

As we see, the expected value depends on the probabilities

Two scenarios

We are always comparing two scenarios, with different probabilities

  • \(\mathcal H_1:\) the sequences have a common ancestor, \(q_{ab}\)
  • \(\mathcal H_0:\) the sequences are independent, \(p_a p_b\)

We want to see which scenario is more plausible, given the sequences

Example

The Turkish word for “dice” is zar

The Spanish word for “random” is azar

Do these two words derive from a common ancestor?

Or is it just a coincidence that they are similar?

Which score matrices are valid?

To be valid, the matrix \(S_{a,b}\) must have

  • at least one positive entry
    • so alignment score can be positive
  • negative expected value under the null hypothesis
    • so random alignments tend to have negative score \[\mathbb E(S\vert\mathcal H_0)=\sum_{a,b}p_a\,p_b\,S_{a,b}<0\]

These conditions guarantee that we can find a good \(\lambda\)

Information Gain

The value \[H= \mathbb E(\lambda S_{a,b}|\mathcal H_1) =\sum q_{a,b}\lambda S_{a,b} =\sum q_{a,b}\ln\frac{q_{a,b}}{p_a p_b} >0 \] is called Information Gain of \(S_{a,b}\)

This is a property that characterizes the matrix \(S_{a,b}\). It can be used to compare different matrices

This is the average score per position in the correct alignment of two related sequences

Higher scores support \(\mathcal H_1\) v/s \(\mathcal H_0\)

High positive scores are good

But how large is large enough?

For a few applications, such as primer design, we can define a threshold score based on theory

In most cases, we need to do the statistical analysis

What is a good alignment?

If we have two sequences \(a_i\) and \(b_j\), then a

A local alignment \(A\) is a pair of subsequences \(a\subset q\) and \(b\subset s\) with gaps in some positions

The score of the alignment \(A\) is \[\mathrm{Score}(A)=\sum_i S_{a_i, b_i}\]

Given a threshold \(s\) we say that \(A\) is a High Score Pairing (HSP) if \[\mathrm{Score}(A)>s\]

How many High Scoring Pairs?

Under the null hypothesis the sequences \(a_i\) and \(b_j\) are random

Therefore the scores of their alignments are random variables

The same happens with the number of high scoring pairs

Let’s call \(N_s\) the number of alignment with score greater than \(s\) \[N_s=\text{Size of}(\{A\text{ alignment}: \mathrm{Score}(A)>s\})\] \[N_s=\text{Cardinality}(\{A\text{ alignment}: \mathrm{Score}(A)>s\})\] \[N_s=\vert\{A\text{ alignment}: \mathrm{Score}(A)>s\}\vert\]

How many High Scoring Pairs?

Since \(N_s\) is a random variable, we can ask what is its expected value? \[\mathbb E(N_s)?\] This number is called the E-value of the score \(s\)

We can also ask what is the probability distribution of \(N_s\)? \[\Pr(N_s=x)?\] It is easier to answer the first question (E-value) and then find out the second

Calculating the E-value

The basic idea is to decompose the E-value into two parts: \[\mathbb E(N_s)=\vert\{\text{positions where an alignment can start}\}\vert\cdot\Pr(\mathrm{Score}(A)>s)\] To simplify notation, let’s call \(\mathcal P\) to the set of all positions where an alignment can start \[\mathbb E(N_s)=\vert\mathcal P\vert\cdot\Pr(\mathrm{Score}(A)>s)\]

Remember that, if we draw a matrix with one sequence on the rows and the other in the columns, then alignments are diagonals

In a first approach \(\vert \mathcal P\vert\approx m\,n\)

Alignments may not overlap

Alignments are diagonals in the matrix

If there is a high scoring diagonal, then other HSP cannot start in any of the positions of the diagonal

Therefore, there is a factor \(K<1\) that limits the number of starting points

This number \(K\) depends on the scoring matrix \(S_{a,b}\)

The formula is very complicated. The computer calculates it for us

Consequences for the E-value

With this adjustment, we have \(\vert \mathcal P\vert= K mn\)

Therefore the E-value is \(\mathbb E(N_s)=Kmn\cdot\Pr(\mathrm{Score}(A)>s)\)

Even before finding the probability, we see that the E-value depends on

  • \(m,\) the size of the sequence
  • \(n,\) the size of the database
  • \(K,\) which depends on the scoring matrix \(S_{a,b}\)

Therefore the E-value is not a property of the sequence.
It depends on the context.

Probabilities for the Score

In general the probability distribution for the score \[\Pr(\mathrm{Score}(A)=s)\] is not easy to calculate, but this \[\Pr(\mathrm{Score}(A)>s)\] is easier

Intuition

First, let’s assume that the score is 1 for a correct matching and \(-\infty\) for a mismatch. Each matching happens with probability \(p\)

To have \(\mathrm{Score}(A)>s\) we need to have at least \(s\) independent matchings, each one with probability \(p\): \[\Pr(\mathrm{Score}(A)>s)=p^s=e^{\ln(p)\cdot s}\] We rewrite it noticing that \(-\ln(p)\) is the surprise of each matching

Since \(\lambda S_{a,b}\) is the relative surprise of each pairing, we can understand that it can be used instead of \(-\ln(p)\cdot s\) \[\Pr(\mathrm{Score}(A)>s)=e^{-\lambda s}\]

E-value formula

The Karlin-Altchulz formula for the E-value of a score \(s\) is \[\mathbb E(N_s) =\vert\mathcal P\vert\cdot\Pr(\mathrm{Score}(A)>s) =Kmn\,e^{-\lambda s} \] If we define the bit-score as \[s'=(\lambda s - \ln K)/\ln(2)\] then the formula can be rewritten as \[\mathbb E(N_s)=mn\cdot 2^{-s'}\]

Information

Conditional probability

The probability of an event depends on our knowledge

Information about an event of \(Z\) may change the probabilities of \(Y\)

\(\Pr(Y|Z)\) may not be the same as \(\Pr(Y)\)

That is the general case. We should not expect that two events are a priori independent

Independence

The variables \(Y\) and \(Z\) are independent if knowing any event of \(Z\) does not change the probabilities of \(Y\) \[\Pr(Y|Z)=\Pr(Y)\] By symmetry, knowing events about \(Y\) does not change the probabilities for \(Z\) \[\Pr(Z|Y)=\Pr(Z)\] We can write \(Y\perp Z\)

Joint probability

If two experiments \(Y\) and \(Z\) are performed, we can study the probability of events on \(Y\) and events on \(Z\)

The probability for both events is then \[\Pr(Y=a, Z=b) = \Pr(Y=a|Z=b)\cdot\Pr(Z=b)\] or in short \[\Pr(Y, Z) = \Pr(Y|Z)\cdot\Pr(Z)\]

Joint probability

The probability of an event on \(Y\) and on \(Z\) can be seen in two parts

  • The probability that the event \(Z\) happens, and
  • The probability that \(Y\) happens, given that \(Z\) happened \[\Pr(Y, Z) = \Pr(Y|Z)\cdot\Pr(Z)\]

Independence and joint prob.

The joint probability is always \[\Pr(Y, Z) = \Pr(Y|Z)\cdot\Pr(Z)\] If \(Y\) and \(Z\) are independent, then \[\Pr(Y|Z)=\Pr(Y)\] Replacing the second equation into the first, we have \[\Pr(Y, Z) = \Pr(Y)\cdot\Pr(Z)\quad\text{ if }Y\perp Z\]

Information

When two variables are not independent, ie. \[\Pr(Y\vert X)\neq\Pr(Y)\] then they share some information.

Knowing something about \(X\) tell us something about \(Y\).

How much does it tell us?

Can we put a value on this information?

One coin

  • I toss a (fair) coin
  • I see the result
  • You do not know the result yet

I know something that you do not know: the result of a binary fair choice

If I tell you the result I am giving you some information We call this one bit of information.

Two coins

  • I toss two fair and independent coins
  • I see the result
  • You do not know the result yet

Now I know twice as much information as before: 2 bits

In general, tossing N coins will give N bits

One unfair coin

This time the coin is really biased, and you know it.

It always falls on Heads

I see the result, but you already know.

Therefore now the information is 0 bits

Rules for Information

The realization of a random variable \(X\) will give us some information

This information depends only on the probability distribution of \(X\)

  • It is never negative: we cannot “unlearn”
  • The information of several independent events is the sum of each information
  • N independent fair coins give N bits
  • If there is a value of \(X\) with probability 1, then the information is 0
  • Two variables with similar distribution have similar information

As a function

We will represent the information of the random variable \(X\) by the function \(H(X)\).

  • \(H(X)\geq 0\)
  • \(H(X,Y) = H(X) + H(Y)\)
  • \(H(\)fair coin\()=1\)
  • \(H(\)biased coin\()=0\)

Formula for Information

Claude Shanon defined the information of a message \(x\) as \[- \log_2 (\Pr(X=x))\] Notice that, since \(\Pr(X=x)\leq 1\), this formula is always positive or zero.

Formula for Information

The information of a random variable \(X\) is the average of all possible messages \[H(X) = - \sum_X \Pr(X) \log_2 (\Pr(X))\] with the convention that \(0\cdot\log_2 (0) = 0\).

Notice that, since \(\Pr(X)\leq 1\), this formula is always positive or zero.

Validation of Rules

  • For a fair coin, \(\Pr(X=\) heads\()=\Pr(X=\) tails\()=1/2\). Therefore \[H(X) = -\frac{1}{2}\log_2(\frac{1}{2})-\frac{1}{2}\log_2(\frac{1}{2})\] \[H(X) = -\log_2(2^{-1}) = 1\]
  • For a biased coin, \(\Pr(X=\) heads\()=1\), \(\Pr(X=\) tails\()=0\). So \[H(X) = - 1\log_2(1) - 0 \log_2(0) = 0\]

Information of independent events

Following the definition, we have \[H(X,Y) = -\sum_X \sum_Y \Pr(X,Y) \log_2(\Pr(X,Y))\] If \(X\) and \(Y\) are idependent, then \(\Pr(X,Y)=\Pr(X)\Pr(Y)\). So \[H(X,Y) = -\sum_X \sum_Y \Pr(X)\Pr(Y) \log_2(\Pr(X)\Pr(Y))\]

Information of independent events

The logarithm of a product is the sum of the logarithms, so \[H(X,Y) = -\sum_X \sum_Y \Pr(X)\Pr(Y) \log_2(\Pr(X)) -\sum_X \sum_Y \Pr(X)\Pr(Y) \log_2(\Pr(Y))\] The sums can be reorganized as \[H(X,Y) = -\sum_Y \Pr(Y) \sum_X \Pr(X) \log_2(\Pr(X)) -\sum_X \Pr(X)\sum_Y \Pr(Y) \log_2(\Pr(Y))\] (Just using the associative and distributive laws)

Information of independent events

Since we know that \(\sum_X \Pr(X)=\sum_Y \Pr(Y)=1\), we have \[H(X,Y) = -\sum_X \Pr(X) \log_2(\Pr(X)) - \sum_Y \Pr(Y) \log_2(\Pr(Y))\] therefore \[H(X,Y) = H(X) + H(Y)\]

It is easy to see that this result can be generalized to several independent random variables.