Class 21: Genetic algorithms

May 9, 2018

Genome Assembly

The protagonists of the story

Genome: Only one, length G, let’s say \(10^7\)

G <- 1E7

reads: N of them, let’s say \(10^4\), each of length L

N <- 1E4
L <- 100

Reads are randomly taken from the genome

Each read has a start and end position on the genome
We take start randomly, sort it, then we calculate end

start <- sample.int(G, size=N, replace=TRUE)
start <- sort(start)
end <- start + L

Distribution of gap length

gap_size <- function(s1, e1, s2, e2) {
  return(s2-e1)
}

gaps <- rep(NA, N-1)
for(i in 1:(N-1)) {
  gaps[i] <- gap_size(start[i],   end[i],
                      start[i+1], end[i+1])
}

Let’s pack everything on one function

gap_sizes <- function(G, L, N) {
  start <- sample.int(G, N, replace=TRUE)
  start <- sort(start)
  end <- start + L
  gaps <- rep(NA, N-1)
  for(i in 1:(N-1)) {
    gaps[i] <- start[i+1] - end[i]
  }
  return(gaps)
}

Homework: can we avoid the for() loop?

Negative `gaps` are `overlaps`

Two reads connect when they overlap

We only see overlaps over a threshold T

The best T has to be big enough to guarantee that the reads do not overlap by chance

Bigger values of T reduce the probability of “overlap by chance”

If two reads overlap, they are in the same Contig

A group of contiguous sequences is called Contig

The goal of the assembly process is to find one contig.

The sequence of this contig will be the genome

Most of times we do get several contigs

Number of contigs depends on G, L, N and T

num_contigs <- function(G,L,N,T) {
  num_gaps <- sum(gap_sizes(G,L,N) > -T)
  return(num_gaps)
}

How many reads shall you pay?

num.reads <- seq(from=10, to=5E5, by=2E4)
replicas <- 1:10
NN <- rep(NA, length(num.reads)*length(replicas))
n.contigs <-  rep(NA, length(num.reads)*length(replicas))
k <- 1
for(size in num.reads) {
    for(i in replicas) {
        n.contigs[k] <- num_contigs(G, L, size, 20)
        NN[k] <- size
        k <- k+1
    }
}

The result is this

plot(n.contigs ~ NN, type="b")

Simpler version

The values do not change very much between simulations (why?)

We do not need replicas here

num.reads <- seq(from=10, to=5E5, by=2E4)
n.contigs <-  rep(NA, length(num.reads))
for(k in 1:length(num.reads)) {
    n.contigs[k] <- num_contigs(G, L, num.reads[k], 20)
}

Result again

plot(n.contigs ~ num.reads, type="b")

What if reads are longer?

L <- 600

Conclusion

Larger reads are better
More reads are better
First, the number of contigs increases
- each read is in different parts of the genome
Later, contigs start to join
With enough reads we can recover the complete genome.

Genetic Algorithms

Usage of “random” in scientific computing

We have used “random” for:

understanding the meaning of your experiments
prepare your next experiment

Now we will see a third usage: Genetic Algorithms

Genetic Algorithms are not about genetics

This is a strategy to find “optimal solutions” to complex problems

Optimal solutions means minimum or maximum

If the problem is small we can just use which.min or which.max

But if the problem is big we cannot test every combination

Genetic algorithms can solve this based on a model of evolution

Imagine mountains and valleys

There are some initial individuals there

pop<-data.frame(x=sample.int(100, 9, replace = TRUE),
                y=sample.int(100, 9, replace = TRUE))

Each one is at different elevation

id	elevation	ranking
1	5.3446109	7
2	2.5057566	3
3	2.1018444	2
4	2.5254296	4
5	0.9899444	1
6	6.1518677	9
7	3.9584632	6
8	5.5782467	8
9	2.8489728	5

Survival rate depend on elevation

We want to replace the individuals that are too uphill.
Probability of replacement is proportional to elevation

survival <- rep(NA, 9)
for(i in 1:9) {
    p <- (ranking[i]-1)/8
    survival[i] <- sample(c(FALSE,TRUE), size=1,
                          prob=c(p,1-p))
}
survival

[1] FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE

The survivors reproduce. The child mutate

parents <- pop[survival,]
for(i in 1:9) {
    if(!survival[i]){
        j <- sample.int(nrow(parents), size=1)
        pop[i,] <- parents[j,]
        k <- sample.int(ncol(parents), size=1)
        pop[i,k] <- pop[i,k] + sample(-10:10, 1)
        pop[i,k] <- min(max(pop[i,k], 1), 100)
    }
}

min(max(...)) is just to keep it between 1 and 100

We have a new generation

All the process is repeated

We can measure how elevation changes

Idea of Genetic algorithms

We want to find the minimum of some complex “fitness” function
We start with a “population” of candidate solutions
Each “individual” encodes a solution in a “chromosome”
We evaluate “fitness” on each “individual”
Survival probability depends on fitness ranking
New individuals are copies of survivors
New individuals get small mutations.

Toy problem

Our population has only vectors with ‘0’ and ‘1’

We want to find a vector (size m) with all elements equal to 1

In this case the solution is trivial, but we will use it to learn

How do we proceed?

What is the fitness function?
How do we make an initial population?
How do we choose the survivors
How do we make the next generation?