May 16, 2018

The big picture

Usage of time

Each day has 24 hours

2017-02-19 13:49Z Sleep W ake up Travel W ork W ork Lunch Work T ravel Home Diner Y our own life

We better choose well what to do wit our time

Our needs

according to A. Maslow

Levels of the pyramid

  • Physiological needs: air, food, water, sleep
  • Safety Needs: Security, Order, and Stability
  • Love and Belonging: family and friends
  • Esteem: comfortable about success, be recognized
  • Cognitive: intellectual stimulation, exploration
  • Aesthetic: harmony, order and beauty
  • Self-realization: morality, creativity, problem solving, acceptance, spontaneity
  • Transcendence: connection to “oneself, to significant others, to human beings in general, to other species, to nature, and to the cosmos”

A job just for “safety” is boring 😒☹️😩😵

Let’s do Science

Four Paradigms of Science

1 Empiric

  • Observation of isolated facts
  • Description of relationships
  • e.g. Botany, naming stars

2 Theoretical

  • Abstract models and theories
  • Usually expressed in mathematical formulas
  • Correct predictions validate the models
  • e.g. Mendel’s laws of inheritance, Newton’s law of Gravity

Four Paradigms of Science

3 Simulation Based

  • Models that cannot be expressed in formulas
  • Formulas that cannot be solved
  • e.g. Protein structure prediction, Genetic Algorithms

4 Data Based

  • Discovering patterns hidden in data
  • Huge volumes of data
  • Complex interactions
  • e.g. Bioinformatics, Data mining

Summary of this Course

Deterministic Simulations

First part was about deterministic simulations

  • Same input gives same output
  • See the consequences of natural laws
  • See the behavior of a system

This part is not in the exam. It will be on the makeup

Non-deterministic Simulations

This is what we did after midterm

  • Output can be different every time
  • There are some rules
  • We see samples
  • We want to understand populations

Key idea: Population v/s Sample

We never see the population: it is too big

We only see samples: small size n

We study the relationship between two things:

  • Probability: Proportion of each outcome on the population
  • Frequency: Proportion of each outcome on the sample

If we know the probability

We can simulate experiments with a given probability and see what can be the frequencies of the outcomes

This helps us to design our experiment

Testing for an event

Events are functions returning TRUE or FALSE, like

  • is the dice an even number?
  • can we assemble the full genome?
  • will you pass this course?

We can use simulations to estimate the probability of an event

If we know the frequency

If we did an experiment or a simulation \(n\) times, and we saw an event \(m\) times, what did we learn about the population?

We cannot know the exact probability \(p\) but we can find a range for it

\[\frac{m+2}{n + 4}\pm 2{\sqrt{\frac{m+2}{(n + 4)^2}\left(1-\frac{m+2}{n + 4}\right)}}\]

Last part: Genetic Algorithms

This part is very practical

This is a specific technique to solve complex optimization problems

The important part is to choose the good fitness function

We will see more later

Practice

How many fish are in the lake?

Imagine you are the main researcher in the ecology of a lake

You need to know how many fish are there in the lake

You catch some fish, put a mark on them and return them to the lake

You catch some fish again, how many of them are marked?

Let’s simulate

Let’s call L to the number of fish in the lake

We catch a group of n and we mark them

We catch another group of n and find that m of them have mark

What can be the value of m?

Fishing

We catch n fish and we mark them 1 to n

The second time we catch n, how many have a number 1 to n?

Assuming that fishing is random, our catch is

sample.int(L, size=n)

We care how many of them are less or equal to n

sum(sample.int(L, size=n) <= n)

Result for \(n=100\) and \(L=10^{4}\)

Result for \(n=200\) and \(L=10^{4}\)

Result for \(n=500\) and \(L=10^{4}\)

Result for \(n=500\) and \(L=10^{5}\)

The result depends on n and L

The catch size n plays two roles here

Fist, it fixes the probability for a marked fish: n/L

Second, it is the sample size that gives m TRUE events

What is the relationship between frequency and probability?

We cannot measure probability

Unless we catch all fish in the lake, we cannot know L exactly

So we have to estimate n/L based on m and n using the formula

\[\frac{m+2}{n + 4}\pm 2{\sqrt{\frac{m+2}{(n + 4)^2}\left(1-\frac{m+2}{n + 4}\right)}}\]

Lets put some numbers

m <- 25
n <- 500
sigma <- sqrt( (m+2)/(n+4)^2 * (1-(m+2)/(n+4)) )
p_est <- c((m+2)/(n+4)-2*sigma, (m+2)/(n+4)+2*sigma)
p_est
[1] 0.03351169 0.07363117

We are 95% sure that n/L is on this interval

So, what is the total number?

We got an interval for n/L. It is easy to see that \[L = \frac{n}{p\_est}\] Thus we can estimate the number of fish as

n/p_est
[1] 14920.167  6790.603

Genetic Algorithms

They all are the same

In all our genetic algorithms we have

  • initial_pop
  • score_all
  • ranking
  • who_survives
  • next_generation
  • cross_over

Generic initial_pop

initial_pop <- function(N, m, values) {
  pop <- list()
  for(i in 1:N) {
    pop[[i]] <- sample(values, size=m, replace=TRUE)
  }
  return(pop)
}

you have to say which values are valid

Generic score_all

score_all <- function(pop, fitness) {
    score <- rep(NA, length(pop))
    for(i in 1:length(pop)) {
        score[i] <- fitness(pop[[i]])
    }
    return(score)
}

you have to give a fitness function

Generic who_survives

who_survives <- function(pop, ranking, min) {
    survival <- rep(NA, length(pop))
    range <- max(ranking)-min(ranking)
    for(i in 1:length(pop)) {
        p <- (ranking[i]-min(ranking))/range
        survival[i] <- sample(c(!min, min),
                              size=1, prob=c(p,1-p))
    }
    return(survival)
}

min is TRUE if you look for minimum, FALSE for maximum

Generic next_gen

next_gen <- function(pop, survival, mutation) {
    parents <- pop[survival]
    for(i in 1:N) {
        if(!survival[i]){
            j <- sample.int(length(parents), size=1)
            pop[[i]] <- parents[[j]] # cloning
            pop[[i]] <- mutation(pop[[i]]) # mutation
        }
    }
    return(pop)
}

Give a proper mutation function

Generic crossover

crossover <- function(pop, survival, mutation) {
 parents <- pop[survival]
 for(i in 1:N) {
  if(!survival[i]) {
   mom_dad <- sample.int(length(parents), size=2)
   m <- length(pop[[i]]) # vector length
   co <- sample.int(m-1, size=1) # cross-over place
   pop[[i]] <- c(parents[[mom_dad[1]]][1:co],
                 parents[[mom_dad[2]]][(co+1):m])
   if(sample(c(FALSE,TRUE), 1)) { # maybe mutation
    pop[[i]] <- mutation(pop[[i]]) # mutation
   }
  }
 }
 return(pop)
}

Practice

We know that \(\sqrt{2}\) is irrational. Nevertheless, we want to find two integer numbers x and y such that \[\frac{x}{y}\approx \sqrt{2}\] In that case we will have \[\frac{x^2}{y^2}-2\approx 0\] Since the result cannot be 0, we will look for an approximate solution

This can be solved with Genetic Algorithms

We want to minimize the absolute value \[\left\vert\frac{x^2}{y^2}-2\right\vert \qquad x,y\in\mathbb N\]

To make it interesting, we want \(x>100\) and \(y>100\)

What would be a good fitness function?

fitness can include logic conditions

fitness <- function(v) {
    x <- v[1]
    y <- v[2]
    too_small <- x<=100 | y<=100
    return(abs(x^2/y^2-2) + too_small*1e6)
}

We need a mutation function

All mutation functions are similar.

This one can be adapted easily to other cases

mutation <- function(v) {
    k <- sample.int(length(v), size=1) # do not change
    v[k] <- v[k] + sample(-10:10, size=1) # change this
    return(v) # do not change
}

Choose m and values

The last thing we need to decide is the vectors’ size m and the values that each component can take.

In this case

m <- 2
values <- 101:1000

Now we are ready to proceed

Running the genetic algorithm

N <- 1000
num_generations <- 300
best_score <- rep(NA, num_generations)
pop <- initial_pop(N, m, values)
for(i in 1:num_generations) {
    score <- score_all(pop, fitness)
    best_score[i] <- max(score)
    ranking <- rank(score, ties.method = "random")
    survival <- who_survives(pop, ranking, min=TRUE)
    pop <- next_gen(pop, survival, mutation)
}

Results

plot(best_score)

min(score)
[1] 6.007287e-06

The best solution has ranking==1

pop[ranking==1]
[[1]]
[1] 816 577

This is a list with one vector. Let’s see if it works

x <- pop[ranking==1][[1]][1]
y <- pop[ranking==1][[1]][2]
x/y
[1] 1.414211

Not bad