May 16, 2018

## Our needs

### according to A. Maslow ## Levels of the pyramid

• Physiological needs: air, food, water, sleep
• Safety Needs: Security, Order, and Stability
• Love and Belonging: family and friends
• Esteem: comfortable about success, be recognized
• Cognitive: intellectual stimulation, exploration
• Aesthetic: harmony, order and beauty
• Self-realization: morality, creativity, problem solving, acceptance, spontaneity
• Transcendence: connection to “oneself, to significant others, to human beings in general, to other species, to nature, and to the cosmos”

## Let’s do Science 1 Empiric

• Observation of isolated facts
• Description of relationships
• e.g. Botany, naming stars

2 Theoretical

• Abstract models and theories
• Usually expressed in mathematical formulas
• Correct predictions validate the models
• e.g. Mendel’s laws of inheritance, Newton’s law of Gravity

3 Simulation Based

• Models that cannot be expressed in formulas
• Formulas that cannot be solved
• e.g. Protein structure prediction, Genetic Algorithms

4 Data Based

• Discovering patterns hidden in data
• Huge volumes of data
• Complex interactions
• e.g. Bioinformatics, Data mining

## Deterministic Simulations

First part was about deterministic simulations

• Same input gives same output
• See the consequences of natural laws
• See the behavior of a system

This part is not in the exam. It will be on the makeup

## Non-deterministic Simulations

This is what we did after midterm

• Output can be different every time
• There are some rules
• We see samples
• We want to understand populations

## Key idea: Population v/s Sample

We never see the population: it is too big

We only see samples: small size n

We study the relationship between two things:

• Probability: Proportion of each outcome on the population
• Frequency: Proportion of each outcome on the sample

## If we know the probability

We can simulate experiments with a given probability and see what can be the frequencies of the outcomes

This helps us to design our experiment

## Testing for an event

Events are functions returning TRUE or FALSE, like

• is the dice an even number?
• can we assemble the full genome?
• will you pass this course?

We can use simulations to estimate the probability of an event

## If we know the frequency

If we did an experiment or a simulation $$n$$ times, and we saw an event $$m$$ times, what did we learn about the population?

We cannot know the exact probability $$p$$ but we can find a range for it

$\frac{m+2}{n + 4}\pm 2{\sqrt{\frac{m+2}{(n + 4)^2}\left(1-\frac{m+2}{n + 4}\right)}}$

## Last part: Genetic Algorithms

This part is very practical

This is a specific technique to solve complex optimization problems

The important part is to choose the good fitness function

We will see more later

## How many fish are in the lake?

Imagine you are the main researcher in the ecology of a lake

You need to know how many fish are there in the lake

You catch some fish, put a mark on them and return them to the lake

You catch some fish again, how many of them are marked?

## Let’s simulate

Let’s call L to the number of fish in the lake

We catch a group of n and we mark them

We catch another group of n and find that m of them have mark

What can be the value of m?

## Fishing

We catch n fish and we mark them 1 to n

The second time we catch n, how many have a number 1 to n?

Assuming that fishing is random, our catch is

sample.int(L, size=n)

We care how many of them are less or equal to n

sum(sample.int(L, size=n) <= n)

## Result for $$n=100$$ and $$L=10^{4}$$ ## Result for $$n=200$$ and $$L=10^{4}$$ ## Result for $$n=500$$ and $$L=10^{4}$$ ## Result for $$n=500$$ and $$L=10^{5}$$ ## The result depends on n and L

The catch size n plays two roles here

Fist, it fixes the probability for a marked fish: n/L

Second, it is the sample size that gives m TRUE events

What is the relationship between frequency and probability?

## We cannot measure probability

Unless we catch all fish in the lake, we cannot know L exactly

So we have to estimate n/L based on m and n using the formula

$\frac{m+2}{n + 4}\pm 2{\sqrt{\frac{m+2}{(n + 4)^2}\left(1-\frac{m+2}{n + 4}\right)}}$

## Lets put some numbers

m <- 25
n <- 500
sigma <- sqrt( (m+2)/(n+4)^2 * (1-(m+2)/(n+4)) )
p_est <- c((m+2)/(n+4)-2*sigma, (m+2)/(n+4)+2*sigma)
p_est
 0.03351169 0.07363117

We are 95% sure that n/L is on this interval

## So, what is the total number?

We got an interval for n/L. It is easy to see that $L = \frac{n}{p\_est}$ Thus we can estimate the number of fish as

n/p_est
 14920.167  6790.603

## They all are the same

In all our genetic algorithms we have

• initial_pop
• score_all
• ranking
• who_survives
• next_generation
• cross_over

## Generic initial_pop

initial_pop <- function(N, m, values) {
pop <- list()
for(i in 1:N) {
pop[[i]] <- sample(values, size=m, replace=TRUE)
}
return(pop)
}

you have to say which values are valid

## Generic score_all

score_all <- function(pop, fitness) {
score <- rep(NA, length(pop))
for(i in 1:length(pop)) {
score[i] <- fitness(pop[[i]])
}
return(score)
}

you have to give a fitness function

## Generic who_survives

who_survives <- function(pop, ranking, min) {
survival <- rep(NA, length(pop))
range <- max(ranking)-min(ranking)
for(i in 1:length(pop)) {
p <- (ranking[i]-min(ranking))/range
survival[i] <- sample(c(!min, min),
size=1, prob=c(p,1-p))
}
return(survival)
}

min is TRUE if you look for minimum, FALSE for maximum

## Generic next_gen

next_gen <- function(pop, survival, mutation) {
parents <- pop[survival]
for(i in 1:N) {
if(!survival[i]){
j <- sample.int(length(parents), size=1)
pop[[i]] <- parents[[j]] # cloning
pop[[i]] <- mutation(pop[[i]]) # mutation
}
}
return(pop)
}

Give a proper mutation function

## Generic crossover

crossover <- function(pop, survival, mutation) {
parents <- pop[survival]
for(i in 1:N) {
if(!survival[i]) {
m <- length(pop[[i]]) # vector length
co <- sample.int(m-1, size=1) # cross-over place
if(sample(c(FALSE,TRUE), 1)) { # maybe mutation
pop[[i]] <- mutation(pop[[i]]) # mutation
}
}
}
return(pop)
}

## Practice

We know that $$\sqrt{2}$$ is irrational. Nevertheless, we want to find two integer numbers x and y such that $\frac{x}{y}\approx \sqrt{2}$ In that case we will have $\frac{x^2}{y^2}-2\approx 0$ Since the result cannot be 0, we will look for an approximate solution

## This can be solved with Genetic Algorithms

We want to minimize the absolute value $\left\vert\frac{x^2}{y^2}-2\right\vert \qquad x,y\in\mathbb N$

To make it interesting, we want $$x>100$$ and $$y>100$$

What would be a good fitness function?

## fitness can include logic conditions

fitness <- function(v) {
x <- v
y <- v
too_small <- x<=100 | y<=100
return(abs(x^2/y^2-2) + too_small*1e6)
}

## We need a mutation function

All mutation functions are similar.

This one can be adapted easily to other cases

mutation <- function(v) {
k <- sample.int(length(v), size=1) # do not change
v[k] <- v[k] + sample(-10:10, size=1) # change this
return(v) # do not change
}

## Choose m and values

The last thing we need to decide is the vectors’ size m and the values that each component can take.

In this case

m <- 2
values <- 101:1000

Now we are ready to proceed

## Running the genetic algorithm

N <- 1000
num_generations <- 300
best_score <- rep(NA, num_generations)
pop <- initial_pop(N, m, values)
for(i in 1:num_generations) {
score <- score_all(pop, fitness)
best_score[i] <- max(score)
ranking <- rank(score, ties.method = "random")
survival <- who_survives(pop, ranking, min=TRUE)
pop <- next_gen(pop, survival, mutation)
}

## Results

plot(best_score) min(score)
 6.007287e-06

## The best solution has ranking==1

pop[ranking==1]
[]
 816 577

This is a list with one vector. Let’s see if it works

x <- pop[ranking==1][]
y <- pop[ranking==1][]
x/y
 1.414211