# Bioinformatics

## Math is not sums, calculations, and formulae.It is pulling things apart to understand how things work

Colin Wright, juggler,
inventor of the mathematical notation of juggling

## Essential Math for Biology

In my opinion, all biologist should know something about

• Set theory
• Logic
• Probabilities
• Graphs (Networks)

The rest depends on each case

(maybe calculus and linear algebra)

# Graphs

## Graphs

A combination of nodes and edges

Nodes are the elements of any set we choose

Edges are pairs of nodes $E⊂N×N$

Nodes are also called vertices

## Mathematical Language

At least two groups of people have worked with networks: engineers and mathematicians

They use different words for the same objects

• Network: graph

• Node: vertex

• Arrow: arc

## There are two kinds of graphs

In directed graphs edges have direction

Edge (a, b) is different from edge (b, a)

In undirected graphs the edges have no direction

Edge {a, b} is the same as edge {b, a}

Directed edges are also called arcs

## Degree of a node

The degree of a node is the number of edges connected to it

In other words, it is the number of neighbors nodes

If the graph is directed, we can also talk about

• in-degree: Number of arcs coming into a given node
• out-degree: Number of arcs going out of a given node

## Extra properties

Depending on the problem, we may add other properties to the nodes and edges. For instance

• Nodes often have names. They may also have color or size
• for a node $$i$$ we write $\text{name}(i)\quad\text{color}(i)\quad\text{size}(i)$
• Edges often have length or weight or cost or capacity
• for an edge between nodes $$i$$ and $$j$$ we write $\text{length}(i,j)\quad\text{weight}(i,j)\quad\text{cost}(i,j)\quad\text{capacity}(i,j)$

## We have seen graphs already

Binary trees are graphs fully connected and without cycles

• The leaves have degree 1
• Internal nodes have degree 3
• The root is the only node with degree 2

In an unrooted tree all nodes have either degree

## Distance between nodes

The length of each edge $$(i,j)$$ is $$\text{len}(i,j)$$

We can calculate the distance between any pair of nodes

$D(i,u) = \begin{cases} \text{len}(i,u)&\text{if }i\text{ is neighbor of }u\\ \min_{j} (\text{len}(i,j) + D(k,u))&\text{otherwise } \end{cases}$

The minimum is taken considering all $$j$$ neighbors of $$i$$

## Example

a b c d e f
a 0 13 11 15 21 22
b 13 0 2 6 12 13
c 11 2 0 4 10 11
d 15 6 4 0 6 7
e 21 12 10 6 0 13
f 22 13 11 7 13 0

## Not all distances correspond to a nice graph

Let’s change only one value

a b c d e f
a 0 13 9 15 21 22
b 13 0 2 6 12 13
c 9 2 0 4 10 11
d 15 6 4 0 6 7
e 21 12 10 6 0 13
f 22 13 11 7 13 0

It is still a valid distance matrix, but cannot be drawn nicely

## Why neighbor joining formula

Let $$i$$ and $$j$$ be two siblings in a nice tree

\begin{aligned} D(a,b) =& D(a,c) + D(c,b)\qquad(\text{eq. }1)\\ D(a,e) =& D(a,c) + D(c,e)\qquad(\text{eq. }2)\\ D(b,e) =& D(b,c) + D(c,e)\qquad(\text{eq. }3)\\ \end{aligned}

## Result

$D(c,e) =\frac{D(a,e)+D(b,e)-D(a,b)}{2}$

So if we only know the distances between leaves $$a, b$$ and $$e,$$ and we add internal node $$c,$$ this is how we find the distance $$D(c,e)$$

Neighbor Joining is trying to make a nice tree

# Probabilities

## An event is a set of outcomes

The set of all possible outcomes is often called Ω

An event 𝐴 can be seen as the set of all outcomes that make the event true

For example,

Fever={Temperature>37.5°C}

## Evaluating rational beliefs

An event will become either true or false after an experiment

For example, a dice can be either 4 or not

We want to give a value to our rational belief that the event will become true after the experiment

The numeric value is called Probability

## Probabilities as Areas

It is useful to think that the probability of an event is the area in the drawing

The total area of Ω is 1

Usually we do not know the shape of 𝐴

## Probabilities depend on our knowledge

Our rational beliefs depend on our knowledge

If we represent our knowledge (or hypothesis) by 𝑍, the the probability of an event 𝐴 is written as $ℙ(A|Z)$ We read “the probability of event 𝐴, given that we know 𝑍”

For example, “the probability that we get a 4, given that the dice is symmetrical”

## Important idea

The order is relevant $ℙ(A|Z)≠ℙ(Z|A)$ There are two events, 𝐴 and 𝑍

The one written after | is what we assume to be true

The one written before | is what we are asking for

One we know, the other we do not

## Visually

Now outcomes are limited only to the 𝑍 region

We measure the area of $$ℙ(A|Z)$$ with respect to the area of 𝑍 instead of Ω

The shape of 𝑍 is often unknown

## Degrees of belief

If, given our knowledge 𝑍, the event 𝐵 is more plausible than the event 𝐴, then $ℙ(A|Z)≤ℙ(B|Z)$

For example, the probability that we get either 4, 5 or 6 is greater than the probability that we get a 4, given that the dice is symmetrical $ℙ(\{4\}|Z)≤ℙ(\{4,5,6\}|Z)$

## Degrees of belief

On the other hand, if we get new information, the probabilities may change

The same event 𝐴 may be more plausible under a new hypothesis 𝑌 than under the initial hypothesis 𝑍

Then $ℙ(A|Z)≤ℙ(A|Y)$

## Probability rules based on these two ideas

It has been proven that probabilities must be like this

1. A probability is a number between 0 and 1 inclusive $ℙ(A) ≥ 0\textrm{ and } ℙ(A)≤1$

2. The probability of an sure event is 1 $ℙ(\textrm{True}) = 1$

3. The probability of an impossible event is 0 $ℙ(\textrm{False}) = 0$

## Complex events

We are interested in non-trivial events, that are usually combinations of smaller events

For example, we may ask “what is the probability that, in a group of 𝑛 people, at least two persons have the same birthday”

Fortunately, any complex event can be decomposed into simpler events, combined with and, or and not connectors

Exercise: decompose the birthday event into simpler ones

## Probability of not 𝐴

If the event 𝐴 becomes more and more plausible, then the opposite event not 𝐴 becomes less and less plausible

It can be shown that we always have $ℙ(\textrm{not } A) = 1-ℙ(A)$

## Joint Probability

The probability of of 𝐴 and 𝐵 happening simultaneously must be connected to the probability of each one

It can be shown that there are only two ways to calculate it

• Start with the prob. of $$A$$ and then of $$B$$ given that $$A$$ is true $ℙ(A,B)=ℙ(A)⋅ℙ(B|A)$
• Start with the prob. of $$B$$ and then of $$A$$ given that $$B$$ is true $ℙ(A,B)=ℙ(B)⋅ℙ(A|B)$

## It must be a multiplication

It can be proven that the only way to combine $$ℙ(A)$$ and $$ℙ(B|A)$$ to get $$ℙ(A,B)$$ is to multiply them.

Both are true, since $$ℙ(A,B)=ℙ(B,A).$$ The order that we write them is irrelevant.

# Example

## Example: diagnosis

As part of the strategy to control COVID-19, many governments carry on random sampling of the population looking for asymptomatic cases.

Imagine that you are randomly chosen for a test of COVID-19. The test result is “positive”, that is, it says that you have the virus. You also know that the test sometimes fails, giving either a false positive or a false negative. Then the question is what is the probability that you have COVID-19 given that the test said “positive”?

## Context

Let’s assume that:

• There are $$r pop.size$$ people tested
• The test has a precision of 99%
• The prevalence of COVID in the population is 0.1%
• The people to test is chosen randomly from the population

Since this context will be the same in all cases, we will not write it explicitly

## Let’s fill this matrix

Test- Test+ Total
COVID- . . .
COVID+ . . .
Total . . .

We show COVID reality in the rows and test results in the columns

Test- Test+ Total
COVID- . . .
COVID+ . . .
Total . . 1e+05

We will fill this matrix in the following slides

A large population size help us to see small values

## 0.1% of them are COVID positive

Test- Test+ Total
COVID- . . 99900
COVID+ . . 100
Total . . 1e+05

Prevalence is the percentage of the population that has COVID.
In other words, it is the probability of (COVID+) \begin{aligned} ℙ(\text{COVID+}) & =0.1\% = 0.001\\ ℙ(\text{COVID-}) & =99.9\%=0.999 \end{aligned}

## 99% are correctly diagnosed

Test- Test+ Total
COVID- . . 99900
COVID+ . 99 100
Total . . 1e+05

Precision is the probability of a correct diagnostic $ℙ(\text{test+} \vert \text{COVID+})=0.99$ We fill the box corresponding to (test+,COVID+) $ℙ(\text{test+}, \text{COVID+})=ℙ(\text{test+} \vert \text{COVID+})\cdotℙ(\text{COVID+})$

## 99% are correctly diagnosed

Test- Test+ Total
COVID- 98901 . 99900
COVID+ . 99 100
Total . . 1e+05

In this case the precision for negative cases is the same $ℙ(\text{test-} | \text{COVID-})=0.99$ We fill the box corresponding to (test-,COVID-) $ℙ(\text{test-}, \text{COVID-})=ℙ(\text{test-} | \text{COVID-})⋅ℙ(\text{COVID-})$

## 1% are misdiagnosed

Test- Test+ Total
COVID- 98901 999 99900
COVID+ 1 99 100
Total . . 1e+05

Misdiagnostic is the negation of good diagnostic $ℙ(\text{test-} | \text{COVID+})=1-ℙ(\text{test+} | \text{COVID+})=0.01$ we combine them in the same way as before $ℙ(\text{test-}, \text{COVID+})=ℙ(\text{test-} | \text{COVID+})⋅ ℙ(\text{COVID+})$

## Total people diagnosed

Test- Test+ Total
COVID- 98901 999 99900
COVID+ 1 99 100
Total 98902 1098 1e+05

We sum and fill the empty boxes

1098 people got positive test, but only 99 of them have COVID%$ℙ(\text{COVID+} | \text{test+})=\frac{99}{1098} = 9.02\%$

## Summary

• The order matters: $$ℙ(A|Z)≠ℙ(Z|A)$$
• To get the probability of $$A$$ and $$B$$ together we find the probability of $$A$$ and then of $$B$$ given that $$A$$ is true $ℙ(A,B)=ℙ(A)⋅ℙ(B|A)$
• Make sure that you ask the correct question. A test can be “precise” and still give many false positives