May 16, 2018
1 Empiric
2 Theoretical
3 Simulation Based
4 Data Based
First part was about deterministic simulations
This part is not in the exam. It will be on the makeup
This is what we did after midterm
We never see the population: it is too big
We only see samples: small size n
We study the relationship between two things:
We can simulate experiments with a given probability and see what can be the frequencies of the outcomes
This helps us to design our experiment
Events are functions returning TRUE or FALSE, like
We can use simulations to estimate the probability of an event
If we did an experiment or a simulation \(n\) times, and we saw an event \(m\) times, what did we learn about the population?
We cannot know the exact probability \(p\) but we can find a range for it
\[\frac{m+2}{n + 4}\pm 2{\sqrt{\frac{m+2}{(n + 4)^2}\left(1-\frac{m+2}{n + 4}\right)}}\]
This part is very practical
This is a specific technique to solve complex optimization problems
The important part is to choose the good fitness function
We will see more later
Imagine you are the main researcher in the ecology of a lake
You need to know how many fish are there in the lake
You catch some fish, put a mark on them and return them to the lake
You catch some fish again, how many of them are marked?
Let’s call L to the number of fish in the lake
We catch a group of n and we mark them
We catch another group of n and find that m of them have mark
What can be the value of m?
We catch n fish and we mark them 1 to n
The second time we catch n, how many have a number 1 to n?
Assuming that fishing is random, our catch is
sample.int(L, size=n)
We care how many of them are less or equal to n
sum(sample.int(L, size=n) <= n)
n and LThe catch size n plays two roles here
Fist, it fixes the probability for a marked fish: n/L
Second, it is the sample size that gives m TRUE events
What is the relationship between frequency and probability?
Unless we catch all fish in the lake, we cannot know L exactly
So we have to estimate n/L based on m and n using the formula
\[\frac{m+2}{n + 4}\pm 2{\sqrt{\frac{m+2}{(n + 4)^2}\left(1-\frac{m+2}{n + 4}\right)}}\]
m <- 25 n <- 500 sigma <- sqrt( (m+2)/(n+4)^2 * (1-(m+2)/(n+4)) ) p_est <- c((m+2)/(n+4)-2*sigma, (m+2)/(n+4)+2*sigma) p_est
[1] 0.03351169 0.07363117
We are 95% sure that n/L is on this interval
We got an interval for n/L. It is easy to see that \[L = \frac{n}{p\_est}\] Thus we can estimate the number of fish as
n/p_est
[1] 14920.167 6790.603
In all our genetic algorithms we have
initial_popscore_allrankingwho_survivesnext_generationcross_overinitial_popinitial_pop <- function(N, m, values) {
pop <- list()
for(i in 1:N) {
pop[[i]] <- sample(values, size=m, replace=TRUE)
}
return(pop)
}
you have to say which values are valid
score_allscore_all <- function(pop, fitness) {
score <- rep(NA, length(pop))
for(i in 1:length(pop)) {
score[i] <- fitness(pop[[i]])
}
return(score)
}
you have to give a fitness function
who_surviveswho_survives <- function(pop, ranking, min) {
survival <- rep(NA, length(pop))
range <- max(ranking)-min(ranking)
for(i in 1:length(pop)) {
p <- (ranking[i]-min(ranking))/range
survival[i] <- sample(c(!min, min),
size=1, prob=c(p,1-p))
}
return(survival)
}
min is TRUE if you look for minimum, FALSE for maximum
next_gennext_gen <- function(pop, survival, mutation) {
parents <- pop[survival]
for(i in 1:N) {
if(!survival[i]){
j <- sample.int(length(parents), size=1)
pop[[i]] <- parents[[j]] # cloning
pop[[i]] <- mutation(pop[[i]]) # mutation
}
}
return(pop)
}
Give a proper mutation function
crossovercrossover <- function(pop, survival, mutation) {
parents <- pop[survival]
for(i in 1:N) {
if(!survival[i]) {
mom_dad <- sample.int(length(parents), size=2)
m <- length(pop[[i]]) # vector length
co <- sample.int(m-1, size=1) # cross-over place
pop[[i]] <- c(parents[[mom_dad[1]]][1:co],
parents[[mom_dad[2]]][(co+1):m])
if(sample(c(FALSE,TRUE), 1)) { # maybe mutation
pop[[i]] <- mutation(pop[[i]]) # mutation
}
}
}
return(pop)
}
We know that \(\sqrt{2}\) is irrational. Nevertheless, we want to find two integer numbers x and y such that \[\frac{x}{y}\approx \sqrt{2}\] In that case we will have \[\frac{x^2}{y^2}-2\approx 0\] Since the result cannot be 0, we will look for an approximate solution
We want to minimize the absolute value \[\left\vert\frac{x^2}{y^2}-2\right\vert \qquad x,y\in\mathbb N\]
To make it interesting, we want \(x>100\) and \(y>100\)
What would be a good fitness function?
fitness can include logic conditionsfitness <- function(v) {
x <- v[1]
y <- v[2]
too_small <- x<=100 | y<=100
return(abs(x^2/y^2-2) + too_small*1e6)
}
mutation functionAll mutation functions are similar.
This one can be adapted easily to other cases
mutation <- function(v) {
k <- sample.int(length(v), size=1) # do not change
v[k] <- v[k] + sample(-10:10, size=1) # change this
return(v) # do not change
}
m and valuesThe last thing we need to decide is the vectors’ size m and the values that each component can take.
In this case
m <- 2 values <- 101:1000
Now we are ready to proceed
N <- 1000
num_generations <- 300
best_score <- rep(NA, num_generations)
pop <- initial_pop(N, m, values)
for(i in 1:num_generations) {
score <- score_all(pop, fitness)
best_score[i] <- max(score)
ranking <- rank(score, ties.method = "random")
survival <- who_survives(pop, ranking, min=TRUE)
pop <- next_gen(pop, survival, mutation)
}
plot(best_score)
min(score)
[1] 6.007287e-06
ranking==1pop[ranking==1]
[[1]] [1] 816 577
This is a list with one vector. Let’s see if it works
x <- pop[ranking==1][[1]][1] y <- pop[ranking==1][[1]][2] x/y
[1] 1.414211
Not bad