May 11, 2018

Genetic Algorithms are not about genetics

This is a strategy to find “optimal solutions” to complex problems

Optimal solutions means minimum or maximum

If the problem is small we can just use which.min or which.max

But if the problem is big we cannot test every combination

Genetic algorithms can solve this based on a model of evolution

Idea of Genetic algorithms

  • We want to find the minimum of some complex “fitness” function
  • We start with a “population” of candidate solutions
  • Each “individual” encodes a solution in a “chromosome”
  • We evaluate “fitness” on each “individual”
  • Survival probability depends on fitness ranking
  • New individuals are copies of survivors
  • New individuals get small mutations.

Toy problem

Our population has only vectors with ‘0’ and ‘1’

We want to find a vector with all elements equal to 1

All vectors have length m, so there are 2m combinations

In this case the solution is trivial, but we will use it to learn

How do we proceed?

  • What is the fitness function?
  • How do we make an initial population?
  • How do we choose the survivors
  • How do we make the next generation?

Fitness function

We want to maximize the number of 1 elements

The fitness function for a vector v is

fitness <- function(v) {
    return(sum(v))
}

Initial population

Let’s start with N individuals

initial_pop <- function(N, m) {
    pop <- list()
    for(i in 1:N) {
        pop[[i]] <- sample( c(0,1),
                            size=m,
                            replace=TRUE)
    }
    return(pop)
}

Score all individuals

We apply fitness function to each vector in pop list

score_all <- function(pop) {
    score <- rep(NA, length(pop))
    for(i in 1:length(pop)) {
      score[i] <- fitness(pop[[i]])
    }
    return(score)
}

Ranking individuals

This is the easy step

ranking <- rank(score, ties.method = "random")

Lower scores get lower rankings

The value of score is not important, only the order

Choose who survives

who_survives <- function(pop, ranking) {
 survival <- rep(NA, length(pop))
 range <- max(ranking)-min(ranking)
 for(i in 1:length(pop)) {
  # high rank has high probability
  p <- (ranking[i]-min(ranking))/range
  survival[i] <- sample(c(TRUE, FALSE),
                    size=1, prob=c(p,1-p))
 }
 return(survival)
}

Be careful with the probabilities

Here higher rankings are “better” than lower rankings, so survival probability increases with ranking

In other cases the “better” cases are the ones with lower ranking, so probability of being replaced increases with ranking

You have to be careful when choosing the survival TRUE and FALSE values

In fact I made this mistake when I presented the class. These slides are corrected

Prepare the next generation

next_gen <- function(pop, survival) {
    parents <- pop[survival]
    for(i in 1:N) {
      if(!survival[i]){
        j <- sample.int(length(parents), size=1)
        pop[[i]] <- parents[[j]] # cloning
        k <- sample.int(length(pop[[i]]), size=1)
        pop[[i]][k] <- 1-pop[[i]][k] # mutation
      }
    }
    return(pop)
}

All together

score <- score_all(pop)
ranking <- rank(score, ties.method = "random")
survival <- who_survives(pop, ranking)
pop <- next_gen(pop, survival)

Running several generations

N <- 100
m <- 60
num_generations <- 300
best_score <- mean_score <- rep(NA, num_generations)
pop <- initial_pop(N, m)
for(i in 1:num_generations) {
    score <- score_all(pop)
    best_score[i] <- max(score)
    mean_score[i] <- mean(score)
    ranking <- rank(score, ties.method = "random")
    survival <- who_survives(pop, ranking)
    pop <- next_gen(pop, survival)
}

Plot

Advanced example

Finding the best protein

In this example we will find a “protein” that minimizes some “energy” function

You can imagine that you want to bind the protein to a particular molecule

In this case I will give you the fitness function: you have to minimize the energy

Energy function

prot_energy <- function(b) {
z <- "01O0EO04O12O05O13O0FO03O14O01O16O09O0FO01O1"
z <- paste0(z,"2O01O16O05O0EO01O04O15O01O12O14O05")
z <- strtoi(strsplit(z,"O")[[1]],16)
z <- z - sapply(toupper(b), utf8ToInt)
return(sum(abs(z+64)))
}

Don’t care about the meaning. Use it as a black box.

Create population

To make life easier, we will use all the English letter as “amino-acids”. In other words, we will use LETTERS

For this example we will only use proteins of size 26

There are 2626=\(6.1561196\times 10^{36}\) combinations

Testing 1E6 every second we would need \(1.9507566\times 10^{17}\) million years to test each combination

Making it faster

So far we imitate bacterial evolution: mutation

In nature there is also another strategy: crossover

Crossover

crossover <- function(pop, survival) {
 parents <- pop[survival]
 for(i in 1:N) {
  if(!survival[i]) {
   mom_dad <- sample.int(length(parents), size=2)
   mom <- mom_dad[1]
   dad <- mom_dad[2]
   m <- length(pop[[i]]) # vector length
   co <- sample.int(m-1, size=1) # cross-over place
   pop[[i]] <- c(parents[[mom]][1:co],parents[[dad]][(co+1):m])
   if(sample(c(FALSE,TRUE), 1)) { # maybe mutation
    mutation_site <- sample.int(m, size=1)
    pop[[i]][mutation_site] <- sample(LETTERS, 1)
   }
  }
 }
 return(pop)
}

Crossover detail

# individual `i` will be replaced
mom_dad <- sample.int(length(parents), size=2)
mom <- mom_dad[1]
dad <- mom_dad[2]
co <- sample.int(m-1, size=1) # cross-over place
pop[[i]] <- c(parents[[mom]][1:co],
              parents[[dad]][(co+1):m])
k <- sample.int(m-1, size=1)
pop[[i]] <- c(parents[[mom]][1:k],
              parents[[dad]][(k+1):m])
if(sample(c(FALSE,TRUE), 1)) {
  mutation_site <- sample.int(m, size=1)
  pop[[i]][mutation_site] <- sample(LETTERS, 1)
}
# rest of the function

Initial population

Similar to the previous one, except that we choose LETTERS instead of 0 or 1

initial_pop <- function(N, m) {
  pop <- list()
  for(i in 1:N) {
    pop[[i]] <- sample(LETTERS, size=m, replace=TRUE)
  }
  return(pop)
}

Scoring all individuals

We only replace fitness by the function we want to minimize: prot_energy

score_all <- function(pop) {
    score <- rep(NA, length(pop))
    for(i in 1:length(pop)) {
        score[i] <- prot_energy(pop[[i]])
    }
    return(score)
}

Who survives

Now small ranking has to survive, so we write c(FALSE, TRUE) instead of c(TRUE, FALSE)

who_survives <- function(pop, ranking) {
    survival <- rep(NA, length(pop))
    range <- max(ranking)-min(ranking)
    for(i in 1:length(pop)) {
        p <- (ranking[i]-min(ranking))/range
        survival[i] <- sample(c(FALSE, TRUE),
                              size=1, prob=c(p,1-p))
    }
    return(survival)
}

Results

In conclusion

Crossover speeds up the optimization process

Optimal solutions are found faster

The population “adapts” faster