April 27, 2018

Simulating experiments

To simulate we need to know

  • What are all possible outcomes
  • What is the probability of each outcome
    • We usually do not know. We have to “guess”
    • Is there any symmetry?
    • Can we decompose in smaller parts?
  • How big is our sample size?
  • How many replicas?

All simulations are similar

experiment <- function(size) {
    sample(outcomes, prob, size, replace=TRUE)
}

If the outcomes are integers 1 to N, we can use

experiment <- function(size) {
    sample.int(N, prob, size, replace=TRUE)
}

Example: Simulate DNA

  • What are the outcomes?
  • What are the probabilities?
  • What is the sample size?

Example GC content in a window

Remember that we had

gc_content <- function(dna, start, window) {
  freq <- table(dna[seq(from=start, length.out=window)])
  ans <- (freq["g"]+freq["c"])/window
  return(ans)
}

We get GC content in windows

We choose a fixed value of window. Then we do

window <- 1000
position <- seq(from=1, to=length(dna), by=window)
gc_cont <- rep(NA, length(position))
for(i  in 1:length(position)) {
    gc_cont[i] <- gc_content(dna, position[i], window)
}

GC content in E.coli

Please read the data using this code, then plot it

library(seqinr)
sequences <- read.fasta("NC_000913.fna")
dna <- sequences[[1]]

What is the GC content of a random window

Some GC content are very high or very low. Is that normal?

How can we find the GC content of a random window?

Probability of each GC content?

If we make an histogram of GC content we see the empirical frequency

h <- hist(gc_cont)

We can later use that frequency to simulate GC content

How can we look inside the histogram?

Looking inside the histogram

h
$breaks
 [1] 0.28 0.30 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48 0.50 0.52 0.54 0.56
[16] 0.58 0.60 0.62 0.64 0.66 0.68

$counts
 [1]   3  11  23  26  50  57  85 169 218 398 591 915 970 711 299  89  14   7   4
[20]   2

$density
 [1]  0.03231366  0.11848341  0.24773804  0.28005170  0.53856097  0.61395950
 [7]  0.91555364  1.82033606  2.34812581  4.28694528  6.36579061  9.85566566
[13] 10.44808272  7.65833692  3.22059457  0.95863852  0.15079707  0.07539854
[19]  0.04308488  0.02154244

$mids
 [1] 0.29 0.31 0.33 0.35 0.37 0.39 0.41 0.43 0.45 0.47 0.49 0.51 0.53 0.55 0.57
[16] 0.59 0.61 0.63 0.65 0.67

$xname
[1] "gc_cont"

$equidist
[1] TRUE

attr(,"class")
[1] "histogram"