April 27, 2018

## To simulate we need to know

• What are all possible outcomes
• What is the probability of each outcome
• We usually do not know. We have to “guess”
• Is there any symmetry?
• Can we decompose in smaller parts?
• How big is our sample size?
• How many replicas?

## All simulations are similar

experiment <- function(size) {
sample(outcomes, prob, size, replace=TRUE)
}

If the outcomes are integers 1 to N, we can use

experiment <- function(size) {
sample.int(N, prob, size, replace=TRUE)
}

## Example: Simulate DNA

• What are the outcomes?
• What are the probabilities?
• What is the sample size?

## Example GC content in a window

Remember that we had

gc_content <- function(dna, start, window) {
freq <- table(dna[seq(from=start, length.out=window)])
ans <- (freq["g"]+freq["c"])/window
return(ans)
}

## We get GC content in windows

We choose a fixed value of window. Then we do

window <- 1000
position <- seq(from=1, to=length(dna), by=window)
gc_cont <- rep(NA, length(position))
for(i  in 1:length(position)) {
gc_cont[i] <- gc_content(dna, position[i], window)
}

## GC content in E.coli

Please read the data using this code, then plot it

library(seqinr)
dna <- sequences[[1]]

## What is the GC content of a random window

Some GC content are very high or very low. Is that normal?

How can we find the GC content of a random window?

## Probability of each GC content?

If we make an histogram of GC content we see the empirical frequency

h <- hist(gc_cont)

We can later use that frequency to simulate GC content

How can we look inside the histogram?

## Looking inside the histogram

h
$breaks [1] 0.28 0.30 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48 0.50 0.52 0.54 0.56 [16] 0.58 0.60 0.62 0.64 0.66 0.68$counts
[1]   3  11  23  26  50  57  85 169 218 398 591 915 970 711 299  89  14   7   4
[20]   2

$density [1] 0.03231366 0.11848341 0.24773804 0.28005170 0.53856097 0.61395950 [7] 0.91555364 1.82033606 2.34812581 4.28694528 6.36579061 9.85566566 [13] 10.44808272 7.65833692 3.22059457 0.95863852 0.15079707 0.07539854 [19] 0.04308488 0.02154244$mids
[1] 0.29 0.31 0.33 0.35 0.37 0.39 0.41 0.43 0.45 0.47 0.49 0.51 0.53 0.55 0.57
[16] 0.59 0.61 0.63 0.65 0.67

$xname [1] "gc_cont"$equidist
[1] TRUE

attr(,"class")
[1] "histogram"