May 16, 2018

**Physiological needs:**air, food, water, sleep**Safety Needs:**Security, Order, and Stability**Love and Belonging:**family and friends**Esteem:**comfortable about success, be recognized**Cognitive:**intellectual stimulation, exploration**Aesthetic:**harmony, order and beauty**Self-realization:**morality, creativity, problem solving, acceptance, spontaneity**Transcendence:**connection to “oneself, to significant others, to human beings in general, to other species, to nature, and to the cosmos”

1 Empiric

- Observation of isolated facts
- Description of relationships
- e.g. Botany, naming stars

2 Theoretical

- Abstract models and theories
- Usually expressed in mathematical formulas
- Correct predictions validate the models
- e.g. Mendel’s laws of inheritance, Newton’s law of Gravity

3 Simulation Based

- Models that cannot be expressed in formulas
- Formulas that cannot be solved
- e.g. Protein structure prediction, Genetic Algorithms

4 Data Based

- Discovering patterns hidden in data
- Huge volumes of data
- Complex interactions
- e.g. Bioinformatics, Data mining

First part was about deterministic simulations

- Same input gives same output
- See the consequences of natural laws
- See the behavior of a system

This part **is not** in the exam. It will be on the makeup

This is what we did after midterm

- Output can be different every time
- There are some rules
- We see
*samples* - We want to understand
*populations*

We never see the *population*: it is too big

We only see *samples*: small size `n`

We study the relationship between two things:

*Probability*: Proportion of each*outcome*on the*population**Frequency*: Proportion of each*outcome*on the*sample*

We can *simulate experiments* with a given probability and see what can be the frequencies of the outcomes

This helps us to *design our experiment*

*Events* are functions returning TRUE or FALSE, like

- is the dice an even number?
- can we assemble the full genome?
- will you pass this course?

We can use simulations to *estimate the probability* of an event

If we did an experiment or a simulation \(n\) times, and we saw an *event* \(m\) times, what did we learn about the population?

We cannot know the *exact probability* \(p\) but we can find a range for it

\[\frac{m+2}{n + 4}\pm 2{\sqrt{\frac{m+2}{(n + 4)^2}\left(1-\frac{m+2}{n + 4}\right)}}\]

This part is very practical

This is a specific technique to solve complex optimization problems

The important part is to choose the good *fitness function*

We will see more later

Imagine you are the main researcher in the ecology of a lake

You need to know how many fish are there in the lake

You catch some fish, put a mark on them and return them to the lake

You catch some fish again, how many of them are marked?

Let’s call `L`

to the number of fish in the *lake*

We catch a group of `n`

and we mark them

We catch another group of `n`

and find that `m`

of them have mark

What can be the value of `m`

?

We catch `n`

fish and we mark them 1 to `n`

The second time we catch `n`

, how many have a number 1 to `n`

?

Assuming that fishing is random, our catch is

sample.int(L, size=n)

We care how many of them are less or equal to `n`

sum(sample.int(L, size=n) <= n)

`n`

and `L`

The catch size `n`

plays two roles here

Fist, it fixes the probability for a marked fish: `n/L`

Second, it is the sample size that gives `m`

TRUE events

What is the relationship between frequency and probability?

Unless we catch **all** fish in the lake, we cannot know `L`

exactly

So we have to *estimate* `n/L`

based on `m`

and `n`

using the formula

\[\frac{m+2}{n + 4}\pm 2{\sqrt{\frac{m+2}{(n + 4)^2}\left(1-\frac{m+2}{n + 4}\right)}}\]

m <- 25 n <- 500 sigma <- sqrt( (m+2)/(n+4)^2 * (1-(m+2)/(n+4)) ) p_est <- c((m+2)/(n+4)-2*sigma, (m+2)/(n+4)+2*sigma) p_est

[1] 0.03351169 0.07363117

We are 95% sure that `n/L`

is on this interval

We got an interval for `n/L`

. It is easy to see that \[L = \frac{n}{p\_est}\] Thus we can estimate the number of fish as

n/p_est

[1] 14920.167 6790.603

In all our genetic algorithms we have

`initial_pop`

`score_all`

`ranking`

`who_survives`

`next_generation`

`cross_over`

`initial_pop`

initial_pop <- function(N, m, values) { pop <- list() for(i in 1:N) { pop[[i]] <- sample(values, size=m, replace=TRUE) } return(pop) }

you have to say which `values`

are valid

`score_all`

score_all <- function(pop, fitness) { score <- rep(NA, length(pop)) for(i in 1:length(pop)) { score[i] <- fitness(pop[[i]]) } return(score) }

you have to give a `fitness`

function

`who_survives`

who_survives <- function(pop, ranking, min) { survival <- rep(NA, length(pop)) range <- max(ranking)-min(ranking) for(i in 1:length(pop)) { p <- (ranking[i]-min(ranking))/range survival[i] <- sample(c(!min, min), size=1, prob=c(p,1-p)) } return(survival) }

`min`

is TRUE if you look for *minimum*, FALSE for *maximum*

`next_gen`

next_gen <- function(pop, survival, mutation) { parents <- pop[survival] for(i in 1:N) { if(!survival[i]){ j <- sample.int(length(parents), size=1) pop[[i]] <- parents[[j]] # cloning pop[[i]] <- mutation(pop[[i]]) # mutation } } return(pop) }

Give a proper `mutation`

function

`crossover`

crossover <- function(pop, survival, mutation) { parents <- pop[survival] for(i in 1:N) { if(!survival[i]) { mom_dad <- sample.int(length(parents), size=2) m <- length(pop[[i]]) # vector length co <- sample.int(m-1, size=1) # cross-over place pop[[i]] <- c(parents[[mom_dad[1]]][1:co], parents[[mom_dad[2]]][(co+1):m]) if(sample(c(FALSE,TRUE), 1)) { # maybe mutation pop[[i]] <- mutation(pop[[i]]) # mutation } } } return(pop) }

We know that \(\sqrt{2}\) is irrational. Nevertheless, we want to find two integer numbers `x`

and `y`

such that \[\frac{x}{y}\approx \sqrt{2}\] In that case we will have \[\frac{x^2}{y^2}-2\approx 0\] Since the result cannot be 0, we will look for an approximate solution

We want to minimize the absolute value \[\left\vert\frac{x^2}{y^2}-2\right\vert \qquad x,y\in\mathbb N\]

To make it interesting, we want \(x>100\) and \(y>100\)

What would be a good fitness function?

`fitness`

can include logic conditionsfitness <- function(v) { x <- v[1] y <- v[2] too_small <- x<=100 | y<=100 return(abs(x^2/y^2-2) + too_small*1e6) }

`mutation`

functionAll `mutation`

functions are similar.

This one can be adapted easily to other cases

mutation <- function(v) { k <- sample.int(length(v), size=1) # do not change v[k] <- v[k] + sample(-10:10, size=1) # change this return(v) # do not change }

`m`

and `values`

The last thing we need to decide is the vectors’ size `m`

and the values that each component can take.

In this case

m <- 2 values <- 101:1000

Now we are ready to proceed

N <- 1000 num_generations <- 300 best_score <- rep(NA, num_generations) pop <- initial_pop(N, m, values) for(i in 1:num_generations) { score <- score_all(pop, fitness) best_score[i] <- max(score) ranking <- rank(score, ties.method = "random") survival <- who_survives(pop, ranking, min=TRUE) pop <- next_gen(pop, survival, mutation) }

plot(best_score)

min(score)

[1] 6.007287e-06

`ranking==1`

pop[ranking==1]

[[1]] [1] 816 577

This is a list with one vector. Let’s see if it works

x <- pop[ranking==1][[1]][1] y <- pop[ranking==1][[1]][2] x/y

[1] 1.414211

Not bad