# Bioinformatics

## Divide and conquer

One common strategy to understand a complex thing is to

Separate it into smaller parts

In other words, decomposition, usually called

Analysis

## For example, we can separate the world into

• Friends, enemies, people we don’t know

• Continents

• Countries

• Species

## An “Ancient Chinese Encyclopedia”

According to some authors, animals are classified as

• those belonging to the emporer
• those that are embalmed
• tame or trained ones
• suckling pigs
• mermaids and sirens
• those that are fabulous
• stray dogs
• those included in the present classification

• frenzied ones
• innumerable ones
• those drawn with a very fine camelhair brush
• other ones
• those that have recently broken a water pitcher
• those that from a long way off look like flies

Jorge Luis Borges (1942) “The Analytical Language of John Wilkins”, under “Celestial Emporium of Benevolent Knowledge”

## Making a good classification

Let’s say we have a large sef of things

Maybe “all living organisms”

Typically we separate this set into two or more subsets

Each organism must be in one and only one subset

## Classification (set theory)

The set $$U$$ of all things is separated in several subsets $$K_1,…, K_n$$ called equivalence classes $U=K_1 ∪ K_2 ∪ … ∪ K_n$

This means that every organism must belong to some equivalence class

Everything is classified

## Classification (set theory)

All equivalence classes are disjoint $K_i ∩ K_j = ∅\quad\text{if }i≠j$

This means that every organism belongs only to one equivalence class

There is only one classification for each thing

$$x$$ is either in $$K_i$$ or $$K_j$$ but not in both

## Formal definition of partition

We have a set $$U$$ that we want to analyze

We have $$n$$ substets of $$U,$$ each one called $$K_i$$

If $$K_1∪K_2∪…∪K_n=U,$$ we say that $$\{K_i\}$$ covers $$U$$

If $$K_i∩K_j=∅$$ whenever $$i≠j,$$ we say that $$\{K_i\}$$ is disjoint

If $$\{K_i\}$$ has these two conditions, we say that it is a partition

• Continents

• Countries

• Species

## Recursive partitioning

We can repeat the process again and again

Each class $$K_i$$ can be split into $$m_i$$ subsets called $$P_j$$

$K_i=P_1 ∪ P_2 ∪ … ∪ P_{m_i}$ $P_i ∩ P_j = ∅\quad\text{if }i≠j$

and so on

This recursive partioning is called Taxonomy ## Hierarchical classification In a taxonomy each equivalence class is divided into smaller equivalence classes. For example

• Some animals are vertebrates
• Some vertebrates are mammals
• Some mammals are primates

There is a hierarchy of classes, with different levels

Classes of the same level are disjoint

Classes of different levels can be subsets

## Example of non-biologic taxonomy

Bloom’s taxonomy

• a set of three hierarchical models used to classify educational learning objectives into levels of complexity and specificity

• cognitive

• affective

• sensory

## Another example

Dewey Decimal Classification for libraries

000 – Computer science, information & general works
100 – Philosophy & psychology
200 – Religion
300 – Social sciences
400 – Language
500 – Pure Science
600 – Technology
700 – Arts & recreation
800 – Literature
900 – History & geography

## Taxonomy in Biology

• Classification of organisms
• Group together organisms sharing “the same” characteristics
• Initially based on phenotypical characteristics
• Today it also uses genotypical information
• Hierarchy of groups
• Depending on the characteristics, we get different groups
• More attributes result in more groups

## Taxonomical Hierarchy in Biology

Originally each hierarchy level (a.k.a. rank) was named

• domain
• kingdom
• phylum
• class
• order
• family
• genus
• species

Today there are more intermediate ranks

## Binomial nomenclature

(literally “system of two names”)

Each organism is labeled with two words: genus and species

• Genus describes what it is in general
• same root as genera used to classify movies
• Species describes what it is special

This is a good approach for any definition
X is like Y but with Z difference”

## Tree representation

Hierarchical classifications are often represented by trees

Trees have root, branches, internal nodes and leaves

Edges (branches) connect nodes

Each node (except the root) has one unique parent node

A node can have several descendants. If a node has no descendants, we call it a leave

## Taxonomy is not phylogeny

Taxonomic trees are similar to phylogenetic trees

But “genus” is not “common ancestor”

Each node in a phylogenetic tree is a species

Moreover, an organism has more than one ancestor

## NCBI taxonomy

There is no “official” taxonomy

People are still figuring out many cases

NCBI has an taxonomy tree that is often used in practice

This tree does change in time

## NCBI taxonomy is a tree

Each node has

• a unique id called taxid
• An official scientific name (as Linnaeus designed)
• Any alternative alias by which the organism can be known
• The taxid of the parent
• zero or more descendants

Using NCBI taxid prevents many errors