May 9, 2018

Genome Assembly

The protagonists of the story

  • Genome: Only one, length G, let’s say \(10^7\)
G <- 1E7
  • reads: N of them, let’s say \(10^4\), each of length L
N <- 1E4
L <- 100

Reads are randomly taken from the genome

  • Each read has a start and end position on the genome
  • We take start randomly, sort it, then we calculate end
start <- sample.int(G, size=N, replace=TRUE)
start <- sort(start)
end <- start + L

Distribution of gap length

gap_size <- function(s1, e1, s2, e2) {
  return(s2-e1)
}

gaps <- rep(NA, N-1)
for(i in 1:(N-1)) {
  gaps[i] <- gap_size(start[i],   end[i],
                      start[i+1], end[i+1])
}

Let’s pack everything on one function

gap_sizes <- function(G, L, N) {
  start <- sample.int(G, N, replace=TRUE)
  start <- sort(start)
  end <- start + L
  gaps <- rep(NA, N-1)
  for(i in 1:(N-1)) {
    gaps[i] <- start[i+1] - end[i]
  }
  return(gaps)
}

Homework: can we avoid the for() loop?

Negative gaps are overlaps

Two reads connect when they overlap

We only see overlaps over a threshold T

The best T has to be big enough to guarantee that the reads do not overlap by chance

Bigger values of T reduce the probability of “overlap by chance”

If two reads overlap, they are in the same Contig

A group of contiguous sequences is called Contig

The goal of the assembly process is to find one contig.

The sequence of this contig will be the genome

Most of times we do get several contigs

Number of contigs depends on G, L, N and T

num_contigs <- function(G,L,N,T) {
  num_gaps <- sum(gap_sizes(G,L,N) > -T)
  return(num_gaps)
}

How many reads shall you pay?

num.reads <- seq(from=10, to=5E5, by=2E4)
replicas <- 1:10
NN <- rep(NA, length(num.reads)*length(replicas))
n.contigs <-  rep(NA, length(num.reads)*length(replicas))
k <- 1
for(size in num.reads) {
    for(i in replicas) {
        n.contigs[k] <- num_contigs(G, L, size, 20)
        NN[k] <- size
        k <- k+1
    }
}

The result is this

plot(n.contigs ~ NN, type="b")

Simpler version

The values do not change very much between simulations (why?)

We do not need replicas here

num.reads <- seq(from=10, to=5E5, by=2E4)
n.contigs <-  rep(NA, length(num.reads))
for(k in 1:length(num.reads)) {
    n.contigs[k] <- num_contigs(G, L, num.reads[k], 20)
}

Result again

plot(n.contigs ~ num.reads, type="b")

What if reads are longer?

L <- 600

Conclusion

  • Larger reads are better
  • More reads are better
  • First, the number of contigs increases
    • each read is in different parts of the genome
  • Later, contigs start to join
  • With enough reads we can recover the complete genome.

Genetic Algorithms

Usage of “random” in scientific computing

We have used “random” for:

  • understanding the meaning of your experiments
  • prepare your next experiment

Now we will see a third usage: Genetic Algorithms

Genetic Algorithms are not about genetics

This is a strategy to find “optimal solutions” to complex problems

Optimal solutions means minimum or maximum

If the problem is small we can just use which.min or which.max

But if the problem is big we cannot test every combination

Genetic algorithms can solve this based on a model of evolution

Imagine mountains and valleys

There are some initial individuals there

pop<-data.frame(x=sample.int(100, 9, replace = TRUE),
                y=sample.int(100, 9, replace = TRUE))

Each one is at different elevation

id elevation ranking
1 5.3446109 7
2 2.5057566 3
3 2.1018444 2
4 2.5254296 4
5 0.9899444 1
6 6.1518677 9
7 3.9584632 6
8 5.5782467 8
9 2.8489728 5

Survival rate depend on elevation

We want to replace the individuals that are too uphill.
Probability of replacement is proportional to elevation

survival <- rep(NA, 9)
for(i in 1:9) {
    p <- (ranking[i]-1)/8
    survival[i] <- sample(c(FALSE,TRUE), size=1,
                          prob=c(p,1-p))
}
survival
[1] FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE

The survivors reproduce. The child mutate

parents <- pop[survival,]
for(i in 1:9) {
    if(!survival[i]){
        j <- sample.int(nrow(parents), size=1)
        pop[i,] <- parents[j,]
        k <- sample.int(ncol(parents), size=1)
        pop[i,k] <- pop[i,k] + sample(-10:10, 1)
        pop[i,k] <- min(max(pop[i,k], 1), 100)
    }
}

min(max(...)) is just to keep it between 1 and 100

We have a new generation

All the process is repeated

We can measure how elevation changes

Idea of Genetic algorithms

  • We want to find the minimum of some complex “fitness” function
  • We start with a “population” of candidate solutions
  • Each “individual” encodes a solution in a “chromosome”
  • We evaluate “fitness” on each “individual”
  • Survival probability depends on fitness ranking
  • New individuals are copies of survivors
  • New individuals get small mutations.

Toy problem

Our population has only vectors with ‘0’ and ‘1’

We want to find a vector (size m) with all elements equal to 1

In this case the solution is trivial, but we will use it to learn

How do we proceed?

  • What is the fitness function?
  • How do we make an initial population?
  • How do we choose the survivors
  • How do we make the next generation?