Class 23: Summary of the course

May 16, 2018

The big picture

Usage of time

Each day has 24 hours

We better choose well what to do wit our time

Our needs

according to A. Maslow

Levels of the pyramid

Physiological needs: air, food, water, sleep
Safety Needs: Security, Order, and Stability
Love and Belonging: family and friends
Esteem: comfortable about success, be recognized
Cognitive: intellectual stimulation, exploration
Aesthetic: harmony, order and beauty
Self-realization: morality, creativity, problem solving, acceptance, spontaneity
Transcendence: connection to “oneself, to significant others, to human beings in general, to other species, to nature, and to the cosmos”

A job just for “safety” is boring 😒☹️😩😵

Let’s do Science

Four Paradigms of Science

1 Empiric

Observation of isolated facts
Description of relationships
e.g. Botany, naming stars

2 Theoretical

Abstract models and theories
Usually expressed in mathematical formulas
Correct predictions validate the models
e.g. Mendel’s laws of inheritance, Newton’s law of Gravity

Four Paradigms of Science

3 Simulation Based

Models that cannot be expressed in formulas
Formulas that cannot be solved
e.g. Protein structure prediction, Genetic Algorithms

4 Data Based

Discovering patterns hidden in data
Huge volumes of data
Complex interactions
e.g. Bioinformatics, Data mining

Summary of this Course

Deterministic Simulations

First part was about deterministic simulations

Same input gives same output
See the consequences of natural laws
See the behavior of a system

This part is not in the exam. It will be on the makeup

Non-deterministic Simulations

This is what we did after midterm

Output can be different every time
There are some rules
We see samples
We want to understand populations

Key idea: Population v/s Sample

We never see the population: it is too big

We only see samples: small size n

We study the relationship between two things:

Probability: Proportion of each outcome on the population
Frequency: Proportion of each outcome on the sample

If we know the probability

We can simulate experiments with a given probability and see what can be the frequencies of the outcomes

This helps us to design our experiment

Testing for an event

Events are functions returning TRUE or FALSE, like

is the dice an even number?
can we assemble the full genome?
will you pass this course?

We can use simulations to estimate the probability of an event

If we know the frequency

If we did an experiment or a simulation \(n\) times, and we saw an event \(m\) times, what did we learn about the population?

We cannot know the exact probability \(p\) but we can find a range for it

\[\frac{m+2}{n + 4}\pm 2{\sqrt{\frac{m+2}{(n + 4)^2}\left(1-\frac{m+2}{n + 4}\right)}}\]

Last part: Genetic Algorithms

This part is very practical

This is a specific technique to solve complex optimization problems

The important part is to choose the good fitness function

We will see more later

Practice

How many fish are in the lake?

Imagine you are the main researcher in the ecology of a lake

You need to know how many fish are there in the lake

You catch some fish, put a mark on them and return them to the lake

You catch some fish again, how many of them are marked?

Let’s simulate

Let’s call L to the number of fish in the lake

We catch a group of n and we mark them

We catch another group of n and find that m of them have mark

What can be the value of m?

Fishing

We catch n fish and we mark them 1 to n

The second time we catch n, how many have a number 1 to n?

Assuming that fishing is random, our catch is

sample.int(L, size=n)

We care how many of them are less or equal to n

sum(sample.int(L, size=n) <= n)

Result for \(n=100\) and \(L=10^{4}\)

Result for \(n=200\) and \(L=10^{4}\)

Result for \(n=500\) and \(L=10^{4}\)

Result for \(n=500\) and \(L=10^{5}\)

The result depends on `n` and `L`

The catch size n plays two roles here

Fist, it fixes the probability for a marked fish: n/L

Second, it is the sample size that gives m TRUE events

What is the relationship between frequency and probability?

We cannot measure probability

Unless we catch all fish in the lake, we cannot know L exactly

So we have to estimate n/L based on m and n using the formula

\[\frac{m+2}{n + 4}\pm 2{\sqrt{\frac{m+2}{(n + 4)^2}\left(1-\frac{m+2}{n + 4}\right)}}\]

Lets put some numbers

m <- 25
n <- 500
sigma <- sqrt( (m+2)/(n+4)^2 * (1-(m+2)/(n+4)) )
p_est <- c((m+2)/(n+4)-2*sigma, (m+2)/(n+4)+2*sigma)
p_est

[1] 0.03351169 0.07363117

We are 95% sure that n/L is on this interval

So, what is the total number?

We got an interval for n/L. It is easy to see that \[L = \frac{n}{p\_est}\] Thus we can estimate the number of fish as

n/p_est

[1] 14920.167  6790.603

Genetic Algorithms

They all are the same

In all our genetic algorithms we have

initial_pop
score_all
ranking
who_survives
next_generation
cross_over

Generic `initial_pop`

initial_pop <- function(N, m, values) {
  pop <- list()
  for(i in 1:N) {
    pop[[i]] <- sample(values, size=m, replace=TRUE)
  }
  return(pop)
}

you have to say which values are valid

Generic `score_all`

score_all <- function(pop, fitness) {
    score <- rep(NA, length(pop))
    for(i in 1:length(pop)) {
        score[i] <- fitness(pop[[i]])
    }
    return(score)
}

you have to give a fitness function

Generic `who_survives`

who_survives <- function(pop, ranking, min) {
    survival <- rep(NA, length(pop))
    range <- max(ranking)-min(ranking)
    for(i in 1:length(pop)) {
        p <- (ranking[i]-min(ranking))/range
        survival[i] <- sample(c(!min, min),
                              size=1, prob=c(p,1-p))
    }
    return(survival)
}

min is TRUE if you look for minimum, FALSE for maximum

Generic `next_gen`

next_gen <- function(pop, survival, mutation) {
    parents <- pop[survival]
    for(i in 1:N) {
        if(!survival[i]){
            j <- sample.int(length(parents), size=1)
            pop[[i]] <- parents[[j]] # cloning
            pop[[i]] <- mutation(pop[[i]]) # mutation
        }
    }
    return(pop)
}

Give a proper mutation function

Generic `crossover`

crossover <- function(pop, survival, mutation) {
 parents <- pop[survival]
 for(i in 1:N) {
  if(!survival[i]) {
   mom_dad <- sample.int(length(parents), size=2)
   m <- length(pop[[i]]) # vector length
   co <- sample.int(m-1, size=1) # cross-over place
   pop[[i]] <- c(parents[[mom_dad[1]]][1:co],
                 parents[[mom_dad[2]]][(co+1):m])
   if(sample(c(FALSE,TRUE), 1)) { # maybe mutation
    pop[[i]] <- mutation(pop[[i]]) # mutation
   }
  }
 }
 return(pop)
}

Practice

We know that \(\sqrt{2}\) is irrational. Nevertheless, we want to find two integer numbers x and y such that \[\frac{x}{y}\approx \sqrt{2}\] In that case we will have \[\frac{x^2}{y^2}-2\approx 0\] Since the result cannot be 0, we will look for an approximate solution

This can be solved with Genetic Algorithms

We want to minimize the absolute value \[\left\vert\frac{x^2}{y^2}-2\right\vert \qquad x,y\in\mathbb N\]

To make it interesting, we want \(x>100\) and \(y>100\)

What would be a good fitness function?

`fitness` can include logic conditions

fitness <- function(v) {
    x <- v[1]
    y <- v[2]
    too_small <- x<=100 | y<=100
    return(abs(x^2/y^2-2) + too_small*1e6)
}

We need a `mutation` function

All mutation functions are similar.

This one can be adapted easily to other cases

mutation <- function(v) {
    k <- sample.int(length(v), size=1) # do not change
    v[k] <- v[k] + sample(-10:10, size=1) # change this
    return(v) # do not change
}

Choose `m` and `values`

The last thing we need to decide is the vectors’ size m and the values that each component can take.

In this case

m <- 2
values <- 101:1000

Now we are ready to proceed

Running the genetic algorithm

N <- 1000
num_generations <- 300
best_score <- rep(NA, num_generations)
pop <- initial_pop(N, m, values)
for(i in 1:num_generations) {
    score <- score_all(pop, fitness)
    best_score[i] <- max(score)
    ranking <- rank(score, ties.method = "random")
    survival <- who_survives(pop, ranking, min=TRUE)
    pop <- next_gen(pop, survival, mutation)
}

Results

plot(best_score)

min(score)

[1] 6.007287e-06

The best solution has `ranking==1`

pop[ranking==1]

[[1]]
[1] 816 577

This is a list with one vector. Let’s see if it works

x <- pop[ranking==1][[1]][1]
y <- pop[ranking==1][[1]][2]
x/y

[1] 1.414211

Not bad

The big picture

Usage of time

Each day has 24 hours

We better choose well what to do wit our time

Our needs

according to A. Maslow

Levels of the pyramid

A job just for “safety” is boring 😒☹️😩😵

Let’s do Science

Four Paradigms of Science

Four Paradigms of Science

Summary of this Course

Deterministic Simulations

Non-deterministic Simulations

Key idea: Population v/s Sample

If we know the probability

Testing for an event

If we know the frequency

Last part: Genetic Algorithms

Practice

How many fish are in the lake?

Let’s simulate

Fishing

Result for \(n=100\) and \(L=10^{4}\)

Result for \(n=200\) and \(L=10^{4}\)

Result for \(n=500\) and \(L=10^{4}\)

Result for \(n=500\) and \(L=10^{5}\)

The result depends on n and L

We cannot measure probability

Lets put some numbers

So, what is the total number?

Genetic Algorithms

They all are the same

Generic initial_pop

Generic score_all

Generic who_survives

Generic next_gen

Generic crossover

Practice

This can be solved with Genetic Algorithms

fitness can include logic conditions

We need a mutation function

Choose m and values

Running the genetic algorithm

Results

The best solution has ranking==1

The result depends on `n` and `L`

Generic `initial_pop`

Generic `score_all`

Generic `who_survives`

Generic `next_gen`

Generic `crossover`

`fitness` can include logic conditions

We need a `mutation` function

Choose `m` and `values`

The best solution has `ranking==1`