May 9, 2018

## The protagonists of the story

• Genome: Only one, length G, let’s say $$10^7$$
G <- 1E7
• reads: N of them, let’s say $$10^4$$, each of length L
N <- 1E4
L <- 100

## Reads are randomly taken from the genome

• Each read has a start and end position on the genome
• We take start randomly, sort it, then we calculate end
start <- sample.int(G, size=N, replace=TRUE)
start <- sort(start)
end <- start + L

## Distribution of gap length

gap_size <- function(s1, e1, s2, e2) {
return(s2-e1)
}

gaps <- rep(NA, N-1)
for(i in 1:(N-1)) {
gaps[i] <- gap_size(start[i],   end[i],
start[i+1], end[i+1])
}

## Let’s pack everything on one function

gap_sizes <- function(G, L, N) {
start <- sample.int(G, N, replace=TRUE)
start <- sort(start)
end <- start + L
gaps <- rep(NA, N-1)
for(i in 1:(N-1)) {
gaps[i] <- start[i+1] - end[i]
}
return(gaps)
}

Homework: can we avoid the for() loop?

## Two reads connect when they overlap

We only see overlaps over a threshold T

The best T has to be big enough to guarantee that the reads do not overlap by chance

Bigger values of T reduce the probability of “overlap by chance”

## If two reads overlap, they are in the same Contig

A group of contiguous sequences is called Contig

The goal of the assembly process is to find one contig.

The sequence of this contig will be the genome

Most of times we do get several contigs

## Number of contigs depends on G, L, N and T

num_contigs <- function(G,L,N,T) {
num_gaps <- sum(gap_sizes(G,L,N) > -T)
return(num_gaps)
}

## How many reads shall you pay?

num.reads <- seq(from=10, to=5E5, by=2E4)
replicas <- 1:10
k <- 1
for(i in replicas) {
n.contigs[k] <- num_contigs(G, L, size, 20)
NN[k] <- size
k <- k+1
}
}

## The result is this

plot(n.contigs ~ NN, type="b")

## Simpler version

The values do not change very much between simulations (why?)

We do not need replicas here

num.reads <- seq(from=10, to=5E5, by=2E4)
n.contigs[k] <- num_contigs(G, L, num.reads[k], 20)
}

## Result again

plot(n.contigs ~ num.reads, type="b")

## What if reads are longer?

L <- 600

## Conclusion

• First, the number of contigs increases
• each read is in different parts of the genome
• Later, contigs start to join
• With enough reads we can recover the complete genome.

## Usage of “random” in scientific computing

We have used “random” for:

• understanding the meaning of your experiments

Now we will see a third usage: Genetic Algorithms

## Genetic Algorithms are not about genetics

This is a strategy to find “optimal solutions” to complex problems

Optimal solutions means minimum or maximum

If the problem is small we can just use which.min or which.max

But if the problem is big we cannot test every combination

Genetic algorithms can solve this based on a model of evolution

## There are some initial individuals there

pop<-data.frame(x=sample.int(100, 9, replace = TRUE),
y=sample.int(100, 9, replace = TRUE))

## Each one is at different elevation

id elevation ranking
1 5.3446109 7
2 2.5057566 3
3 2.1018444 2
4 2.5254296 4
5 0.9899444 1
6 6.1518677 9
7 3.9584632 6
8 5.5782467 8
9 2.8489728 5

## Survival rate depend on elevation

We want to replace the individuals that are too uphill.
Probability of replacement is proportional to elevation

survival <- rep(NA, 9)
for(i in 1:9) {
p <- (ranking[i]-1)/8
survival[i] <- sample(c(FALSE,TRUE), size=1,
prob=c(p,1-p))
}
survival
[1] FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE

## The survivors reproduce. The child mutate

parents <- pop[survival,]
for(i in 1:9) {
if(!survival[i]){
j <- sample.int(nrow(parents), size=1)
pop[i,] <- parents[j,]
k <- sample.int(ncol(parents), size=1)
pop[i,k] <- pop[i,k] + sample(-10:10, 1)
pop[i,k] <- min(max(pop[i,k], 1), 100)
}
}

min(max(...)) is just to keep it between 1 and 100

## Idea of Genetic algorithms

• We want to find the minimum of some complex “fitness” function
• Each “individual” encodes a solution in a “chromosome”
• We evaluate “fitness” on each “individual”
• Survival probability depends on fitness ranking
• New individuals are copies of survivors
• New individuals get small mutations.

## Toy problem

Our population has only vectors with ‘0’ and ‘1’

We want to find a vector (size m) with all elements equal to 1

In this case the solution is trivial, but we will use it to learn

## How do we proceed?

• What is the fitness function?
• How do we make an initial population?
• How do we choose the survivors
• How do we make the next generation?