April 27, 2018
experiment <- function(size) {
sample(outcomes, prob, size, replace=TRUE)
}
If the outcomes are integers 1 to N, we can use
experiment <- function(size) {
sample.int(N, prob, size, replace=TRUE)
}
Remember that we had
gc_content <- function(dna, start, window) {
freq <- table(dna[seq(from=start, length.out=window)])
ans <- (freq["g"]+freq["c"])/window
return(ans)
}
We choose a fixed value of window. Then we do
window <- 1000
position <- seq(from=1, to=length(dna), by=window)
gc_cont <- rep(NA, length(position))
for(i in 1:length(position)) {
gc_cont[i] <- gc_content(dna, position[i], window)
}
Please read the data using this code, then plot it
library(seqinr)
sequences <- read.fasta("NC_000913.fna")
dna <- sequences[[1]]
Some GC content are very high or very low. Is that normal?
How can we find the GC content of a random window?
If we make an histogram of GC content we see the empirical frequency
h <- hist(gc_cont)
We can later use that frequency to simulate GC content
How can we look inside the histogram?
h
$breaks [1] 0.28 0.30 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48 0.50 0.52 0.54 0.56 [16] 0.58 0.60 0.62 0.64 0.66 0.68 $counts [1] 3 11 23 26 50 57 85 169 218 398 591 915 970 711 299 89 14 7 4 [20] 2 $density [1] 0.03231366 0.11848341 0.24773804 0.28005170 0.53856097 0.61395950 [7] 0.91555364 1.82033606 2.34812581 4.28694528 6.36579061 9.85566566 [13] 10.44808272 7.65833692 3.22059457 0.95863852 0.15079707 0.07539854 [19] 0.04308488 0.02154244 $mids [1] 0.29 0.31 0.33 0.35 0.37 0.39 0.41 0.43 0.45 0.47 0.49 0.51 0.53 0.55 0.57 [16] 0.59 0.61 0.63 0.65 0.67 $xname [1] "gc_cont" $equidist [1] TRUE attr(,"class") [1] "histogram"