May 9, 2018

- Genome: Only one, length
`G`

, let’s say \(10^7\)

G <- 1E7

- reads:
`N`

of them, let’s say \(10^4\), each of length`L`

N <- 1E4 L <- 100

- Each read has a
`start`

and`end`

position on the genome - We take
`start`

randomly, sort it, then we calculate`end`

start <- sample.int(G, size=N, replace=TRUE) start <- sort(start) end <- start + L

gap_size <- function(s1, e1, s2, e2) { return(s2-e1) } gaps <- rep(NA, N-1) for(i in 1:(N-1)) { gaps[i] <- gap_size(start[i], end[i], start[i+1], end[i+1]) }

gap_sizes <- function(G, L, N) { start <- sample.int(G, N, replace=TRUE) start <- sort(start) end <- start + L gaps <- rep(NA, N-1) for(i in 1:(N-1)) { gaps[i] <- start[i+1] - end[i] } return(gaps) }

Homework: can we avoid the `for()`

loop?

`gaps`

are `overlaps`

We only see overlaps over a threshold `T`

The best `T`

has to be big enough to guarantee that the reads do not overlap by chance

Bigger values of `T`

reduce the probability of “overlap by chance”

A group of contiguous sequences is called **Contig**

The goal of the assembly process is to find **one** contig.

The sequence of this contig will be the genome

Most of times we do get several contigs

num_contigs <- function(G,L,N,T) { num_gaps <- sum(gap_sizes(G,L,N) > -T) return(num_gaps) }

num.reads <- seq(from=10, to=5E5, by=2E4) replicas <- 1:10 NN <- rep(NA, length(num.reads)*length(replicas)) n.contigs <- rep(NA, length(num.reads)*length(replicas)) k <- 1 for(size in num.reads) { for(i in replicas) { n.contigs[k] <- num_contigs(G, L, size, 20) NN[k] <- size k <- k+1 } }

plot(n.contigs ~ NN, type="b")

The values do not change very much between simulations (why?)

We do not need replicas here

num.reads <- seq(from=10, to=5E5, by=2E4) n.contigs <- rep(NA, length(num.reads)) for(k in 1:length(num.reads)) { n.contigs[k] <- num_contigs(G, L, num.reads[k], 20) }

plot(n.contigs ~ num.reads, type="b")

L <- 600

- Larger reads are better
- More reads are better
- First, the number of contigs increases
- each read is in different parts of the genome

- Later, contigs start to join
- With enough reads we can recover the complete genome.

We have used “random” for:

- understanding the meaning of your experiments
- prepare your next experiment

Now we will see a third usage: *Genetic Algorithms*

This is a strategy to find “optimal solutions” to complex problems

*Optimal solutions* means **minimum** or **maximum**

If the problem is small we can just use `which.min`

or `which.max`

But if the problem is big we cannot test **every** combination

**Genetic algorithms** can solve this based on a model of evolution

pop<-data.frame(x=sample.int(100, 9, replace = TRUE), y=sample.int(100, 9, replace = TRUE))

id | elevation | ranking |
---|---|---|

1 | 5.3446109 | 7 |

2 | 2.5057566 | 3 |

3 | 2.1018444 | 2 |

4 | 2.5254296 | 4 |

5 | 0.9899444 | 1 |

6 | 6.1518677 | 9 |

7 | 3.9584632 | 6 |

8 | 5.5782467 | 8 |

9 | 2.8489728 | 5 |

We want to replace the individuals that are too uphill.

Probability of replacement is proportional to elevation

survival <- rep(NA, 9) for(i in 1:9) { p <- (ranking[i]-1)/8 survival[i] <- sample(c(FALSE,TRUE), size=1, prob=c(p,1-p)) } survival

[1] FALSE TRUE TRUE FALSE TRUE FALSE TRUE FALSE TRUE

parents <- pop[survival,] for(i in 1:9) { if(!survival[i]){ j <- sample.int(nrow(parents), size=1) pop[i,] <- parents[j,] k <- sample.int(ncol(parents), size=1) pop[i,k] <- pop[i,k] + sample(-10:10, 1) pop[i,k] <- min(max(pop[i,k], 1), 100) } }

`min(max(...))`

is just to keep it between 1 and 100

- We want to find the minimum of some complex “fitness” function
- We start with a “population” of candidate solutions
- Each “individual” encodes a solution in a “chromosome”
- We evaluate “fitness” on each “individual”
- Survival probability depends on fitness ranking
- New individuals are copies of survivors
- New individuals get small mutations.

Our population has only vectors with ‘0’ and ‘1’

We want to find a vector (size `m`

) with all elements equal to 1

In this case the solution is *trivial*, but we will use it to learn

- What is the fitness function?
- How do we make an initial population?
- How do we choose the survivors
- How do we make the next generation?