Class 17: Virtual experiments

Computing for Molecular Biology 2

Andrés Aravena, PhD

7 May 2021

sample() function

We will use the function

sample(x, size, replace, prob)
  • x is a vector of possible outcomes
  • size is the number of elements to take
  • replace is TRUE to allow repeated outcomes
  • prob is the proportion of each possible outcome in a very large population

Use cases

  • sample(x) just shuffles x
  • sample(x, size) first shuffles x, then takes size elements
  • sample(x, size, replace=TRUE) allows us to have size larger than length(x)
  • sample(x, size, replace=TRUE, prob) changes the probability of each possible outcome

Simulating random systems

Making experiments in the computer

Simulating DNA

First, let’s give a name to the possible outcomes

DNA_alphabet <- c("A","C","G","T")

The easiest way to get a simulated DNA sequence is

sample(DNA_alphabet, replace=TRUE, size=5)
[1] "G" "G" "G" "C" "G"

Replicating the experiment

We want to see what happens with our experiment if we could replicate it several times

The easiest way is to use replicate()

replicate(3, sample(DNA_alphabet, replace=TRUE, size=5))
     [,1] [,2] [,3]
[1,] "C"  "T"  "G" 
[2,] "C"  "C"  "T" 
[3,] "C"  "C"  "A" 
[4,] "G"  "A"  "G" 
[5,] "A"  "C"  "G" 

replicate, not repeat

This function can be confused with rep() With replicate, the function sample is used 3 independent times

replicate(3, sample(DNA_alphabet, 5, replace = TRUE))
     [,1] [,2] [,3]
[1,] "A"  "G"  "A" 
[2,] "T"  "T"  "C" 
[3,] "A"  "C"  "G" 
[4,] "A"  "G"  "T" 
[5,] "A"  "C"  "C" 

replicate, not repeat

With rep, the function sample is used one time, and the result is copied 3 times. It is not random

rep(sample(DNA_alphabet, 5, replace = TRUE), 3)
 [1] "A" "G" "G" "A" "T" "A" "G" "G" "A" "T" "A" "G" "G" "A" "T"

Also, notice that the number of replicas/repetitions is written in different places

replicate output can be a list

By default, the output is simplified into a matrix. To get a list instead, we use

replicate(3, sample(DNA_alphabet, 5, replace = TRUE), simplify=FALSE)
[[1]]
[1] "G" "T" "C" "A" "G"

[[2]]
[1] "A" "A" "C" "G" "G"

[[3]]
[1] "T" "A" "G" "A" "G"

Changing GC content.

The GC content of C.rudii is 17%. To simulate its DNA, we need to indicate the proportion of each nucleotide.

The vector prob must have the same length as DNA_alphabet

gc <- 0.17
at <- 1-gc
sample(DNA_alphabet, 100, replace = TRUE, prob=c(at/2, gc/2, gc/2, at/2))
  [1] "A" "A" "T" "G" "A" "T" "A" "A" "A" "T" "T" "T" "T" "T" "T" "T"
 [17] "T" "T" "A" "T" "A" "A" "T" "A" "A" "A" "T" "T" "T" "A" "T" "C"
 [33] "G" "G" "A" "A" "T" "A" "T" "A" "A" "T" "A" "T" "T" "T" "A" "T"
 [49] "C" "T" "G" "G" "T" "A" "A" "C" "A" "A" "C" "T" "A" "T" "C" "T"
 [65] "A" "T" "A" "A" "A" "A" "C" "A" "A" "A" "T" "T" "G" "T" "T" "T"
 [81] "T" "T" "T" "C" "T" "A" "A" "A" "A" "G" "A" "A" "A" "A" "T" "G"
 [97] "T" "A" "A" "A"

We will use this DNA simulator later

Coins

We call “coin” any experiment that has two possible outcomes

  • Heads, “H”, success, alive, healthy, 1
  • Tails, “T”, failure, dead, sick, 0

A fair coin has 50% probability for each outcome

The interesting case is when the probabilities are different

Several Identical Coins

Let’s try a 10% coin

sample(c("H","T"), replace=TRUE, size=15, prob=c(0.1, 0.9))
 [1] "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T"

and an 80% coin

sample(c("H","T"), replace=TRUE, size=15, prob=c(0.8, 0.2))
 [1] "H" "H" "T" "H" "H" "H" "H" "H" "T" "H" "T" "H" "H" "H" "H"

Remember to write all probabilities

Counting Number of successes

If each coin is a test, we want to know how many tests were successful.

coins <- sample(c("H","T"), replace=TRUE, size=20, prob=c(0.3, 0.7))
sum(coins=="H")
[1] 5

This simulates, for example, growing 20 plants and counting how many of them survive

Number of successes

We can wrap all in a function like this

num_successes <- function(size, p) {
    coins <- sample(c("H","T"), replace=TRUE, size=size, prob=c(p, 1-p))
    return(sum(coins=="H"))
}

or, better

num_successes <- function(size, p) {
    coins <- sample(c(1, 0), replace=TRUE, size=size, prob=c(p, 1-p))
    return(sum(coins))
}

(notice that you must say size=size)

Simulating several experiments

several <- replicate(1000, num_successes(20, 0.3))
print(several)
   [1]  6  3  9  5  8  6  5  5  7  7  4  5  6  6  5  5  9  6  9  4 10
  [22]  9  9  7  2  5  8  3  6  3  6  6  6  5  7  7  7  3  7  5  4  2
  [43]  7  7  9  5  7  4  5  7  8  4  5  7  8  6  3  3  4  3  9  7  9
  [64] 12  8  5  4  5  5  5  5  9  8  3  9  6  8  9  5  5  5  9  6  5
  [85]  6  4  3  5  5  3  5  7  8  6  4  9  5  5  8  9  3  7  7  5  6
 [106]  2  7  5  6  8  4  5  6  4  8  5 10  4  5  7  5  4  7  4  7  9
 [127]  8  4  7  6 10  7  7  6  8  8  4  6  5  6  8  7  8  7  6  7  5
 [148]  4  6  6  6  6  6  6  5  6  6  8  6  2  7  5  1  5  8  3  8  7
 [169] 10  3  4  3  8  7  6  7  6  8 10  6  7  5  8 10  2  6  8  6  7
 [190]  5  8  8  4  7  5  8  5  4  6  5  2  4  7  7  3  6  2  4  7  6
 [211]  8  9  4  8  1  5  7  2 10  4 10  7  9  2  5  7  7  8  6  6  4
 [232]  3  5  4  6  3  7  4  6  7  9  9  7  6  6  7  9  5  4 10  8  6
 [253]  3  3  6  4  1  6  9  4  3  8  6  7  4  6  6  7  8  1  6  6  5
 [274]  4  4  7  6  6  6  5  6  5  7  5 10  6  7  8  6  7  9  8  4  7
 [295]  3  4  8  8  6  9  8  4  2  3  2  4  4 11  2  5  4 10  8  5  9
 [316]  4  6  4  6  6  7  6  8  7  9  6  5  5  6  5  3  4  6  4  8  9
 [337]  5  3  5  6  8  6  5  6  8  1  7  9  5  8  7  4  5  7  8  3 10
 [358]  7  4  7  2  6  6  7  9  7  5  3  5  3  3  5  5  4  8  3  4  5
 [379]  7  6  9  4 10  7  7  6  8  5  6  6  8  7  3  4  7  6  4  8  7
 [400]  8  4  4  6  8  6  4  4  5  4  5  5  5  4  4  6  4  8  8  7  7
 [421]  6  7  5  7  4  3  5  7  5  5  6  3  6  5  5  5  6  7  5  7  7
 [442]  7  5  9  8  5  4  8  5  5  4  4 10  4  3  5  8  6  5  7  4  3
 [463]  9  8  6  4  6  4  7  5  3  5  8  4  5  7  7  7  7  6  5  6  6
 [484]  7  6  9  9  4  6  7  9  2  7  7  3  9  4  9  4  7  4  3  4  7
 [505]  6  6  9  4  5 10  4  6  4  3  5  8  6  4  4  7  9 10  3  6  7
 [526]  5  4  2  4  9  6  7 10  7  7  7  7  7  6  5  8  7  2  8  5  6
 [547]  5  9  4  9  6  3  8  6  6  6  9  7  7  4  8  6  9  7  8 10  5
 [568]  7  5  7  8  9  5  4  5  6  3  6  6  7  4  8  7  8  9  4  4  6
 [589]  4  9  6  6  7  4  4  6  5  7  8  5  5  2  3 10  7  7  7  6  8
 [610]  6  5  6  7  7  7  7  6  8  5  5  3  6  5  7  5  6  4  6  9  4
 [631]  5  6  6  8  7 11  4  5  7  4  9  5  6 11  2  6  8  6  7  3  6
 [652]  9  4  5  5  7  5  6  4  6  8  2  5  4  5  5  5  4  5  3  7  7
 [673]  3  6  5  6  8 10  8  3  7  8  3  3  9  3  6  7  8  3  9  8 11
 [694]  8  8  7  4  2  5  5  5  4  6  5  5  5  9  6  5  7  4  5 10 11
 [715]  5  9  3  9  4  5  9  6  4  7  4  3  6  7  3  8  8  4  7  6  5
 [736]  6  4  6  7  6  5  2  8  4  4  7  5  5  9  7  5  4  6  5  7  8
 [757]  5  8  5  9  6  4  5  3  6  6  9  3  6  4  7  5  2  5  9  5  5
 [778]  6  7  2  8  2  5  4  8  4  6  7  7  8  5  7  4  9  3  5  4  9
 [799]  3  8  7  7  7  8  6  5  5 10  5  3  4  6  7  6 10  5  7  7  7
 [820]  4  9  8  9 10  5  3  6  5  9  8  3 10  7  7  4  7  6  8  8  7
 [841]  5  7  6  2 10  4  6  8 10  5  5  6  6  9  8  8  2  3  4  4 10
 [862]  9  4  4  4  7  3  7  6  6  6  7  7 10  1  7  9  8  7  6  5  8
 [883]  8  3  6  5  3  5  4  5  6  6  6  5  9  9  3  5  7  6  5  3  8
 [904]  4  8  5  6  4  5  4  6  6 10  4  3  4  3  3 10  3  5  4  2  4
 [925]  3  4  3  6  5  5  7 11  7  5  7  6  9  1  9  5  3 10  4  5  8
 [946]  8  6  3  4  5  5  6  4  8  8  9  6  7  5  3  6  3  8  5  3  5
 [967]  3  8  4  7  8  6  2  5  7  6  5  6  9  7  4  6  4  6  6  7  7
 [988]  3  6  6  5  8  2  5  7  4  4  6  6  3

Graphically

barplot(table(several))

A smarter way

Counting the number of successful tests is one of the most common cases in experimental sciences

Thus, there is a better way to simulate this case

replicate(100, rbinom(size=20, prob=0.3, n=1))

or better

rbinom(size=20, prob=0.3, n=1000)
   [1]  4  9  8  8  6  4  5  7  4  6  7  7 13  5  8  5  4  8  8  6  7
  [22]  7  4  5  5  1 10  5  8  6  4 10  8  6  6  2  9  5  5  4  8  7
  [43]  4  5  8  7  6  8  6  4  7  8  6  4  3  6  5  5  5  2  7  5  7
  [64]  2  5  3  4  6 11  7  7  7  4  5  3  7  3  8  5  5  6  7  8  3
  [85]  6  5  3  6  8  7  6  4  5  5  6  6  6  8  6  8  4  6  5  7  5
 [106]  7  4  4 10 11  2  6  4  7  3  3  6  5  4  6  8  8  7  7  7  6
 [127]  6  7  4  7  3  8  5  8  5  7  5  4  6  5  7  5  3  9  8  5  5
 [148]  5  8 10  2  9  7  6  5  7  4  4  2  7  6  5  3  6  6  4  4  2
 [169]  4  3  6  8  5  3  5  5  6  4  5  6  8  7  9  5  4  4  6  7  6
 [190]  6  5  9  4  6  9  8  5  5  3  8  7  6  5  5  6  6  6  6  4 10
 [211]  9  8  5  5  3  6  6  8  7  3  6  6  6  6  4  8  4  7  5  8  4
 [232]  7  6  8  7 11  5  6 10  8  7  8  8  3  5  6  8  5  2  4  5  4
 [253]  3  5  2  5  5  8  5  7  5  6  4  6  5 10  7  3  4  6  3  7  7
 [274]  6  7  5 10  7  9  6  5  5  4  6  5  6  7  6  7  8  7  8  7  8
 [295]  4  5  5  8  4  6  7  8  8  7  4  5  6  6  7  5  4  5  2 10  2
 [316]  9  2  5  4  5  9  9  7  6  5  7  5  6  5  8  6  7  6  5  6  8
 [337]  4  8  4  7  4  9  7  5  5  6  6  5  9  8  6  6  4  5  9  1  8
 [358]  6  4  6  5 12  6  4  8  8  3  7 10  6  4  7  5  8  8  6  4  5
 [379]  6  5  4  2  7  5  8  6  6  9  6  7  7  8  7  4  6  7  7  4  5
 [400]  7  6 10  7  3  8  6  6  6  7  7  7 10  7  2  5  8  9  5  4  6
 [421]  5  8  8  9  7  9  6 10  5  2  8  5  6  5  5  7  3  4 10  5  4
 [442]  6  1  2  2  7  9  6  6  8  8  4  2  6  5  8  4  2  7  8  4  6
 [463]  8  3  4  4  6  4  7  6  4  9  4  6  6  5  8  5  5  7  5  6  2
 [484]  7  8  6  4  6  8  4  7  4  7  4  1  6  3  6  6 11  8  8  7  8
 [505]  6  7  7  5  7  7  6  6  5  4  4  7  3  4  7 11  4  5  8  6  6
 [526]  4  6  3  5  7  6  7  3  9 10  4  7  5  6 11  8  8  5  6  7  5
 [547]  4  5  9  4  8  8  4  6  8  5  8  4  8 10  8  7  5  5  7  6 10
 [568]  4  9  5  6  4  4  8  4  8  9  8 10  5  8  5  7  0  9  7  6  4
 [589]  7 10  8  5  7  6  5  8  0  5  7  7  4  8  5  6  9 10 12  5  6
 [610]  8  3  7  7  6  4  9  7  6  4  6  5  6  5  7  8  4  6  3  5  4
 [631]  7  5  8  8  6  7  9  6  8  8  7  8  6  3  8  6  5  6  5  9  5
 [652]  5  4  6  9  8  6  9 10  8  4  6  7  7  4  2  4  3  7  3  7  8
 [673]  5  8  6  8  5  7  4  6  6  7  4  7  9  9  6  4  4  7 10  3  7
 [694]  2  7  3  7  2  5  4  6  3  8 10  8  8  5  5  3  5  1  9  7  6
 [715]  7  4  2  6 10  6  9  6  3 11  4  8  2  7  9  5  6  9  7  8  6
 [736]  3  6  5  4  9  1  2  8  5  7  6  3  4  6  9  6  5  9  5  9  3
 [757]  7  6  3  5  4  5  4  5  7  8  4  5  5  5  5  6  4  7  6  2  3
 [778]  3  4  6  8  8  6  6  6  5  6  6  4  8  7  8  5  4  6  8  4  5
 [799]  6  7  8  6  7  6  8  7  3 11  7  6  8  4  4  8  5  5  8  6  6
 [820]  4  7  7  6  3  8  4  8  7  8  9  9  8  6  8  9  3  7  7  7  4
 [841]  8  6  5  9  7  3  7  8  8  2  5  3  3  6  4  9  5  5  9  4  5
 [862]  6  8  8  8  8 10  6  4  8  5  6  7  6  4  6  6  8  5  4  4  4
 [883]  7  5  7  6  7  7  5  7  5 10  8  5  2 10  4  9  8  8 10  7  9
 [904]  6  9  2  4  3  8  4  6  6  8  9  6  4  4  7  7  2  6 11  5  6
 [925]  4  2  8  8  8  9  5  8  9  8  6  5  8  9  6  5  5  2  4 10  5
 [946]  8  7  7  3  3  9  8  5  6  5  8  6 10  8  5  8  7  5  9  5  8
 [967]  3  6  1  5  6  7  4  7  4  4  3  3  6  7  8  6  8  4  7  6  5
 [988]  8  6  7  3  6  1  5  8  6  6  5  8  6

This is a Binomial distribution

library(magrittr)
rbinom(size=20, prob=0.3, n=1000) %>% table %>% barplot()

Random Variables

When the outcomes are numbers

DNA and coins are random systems that produce symbols

Other cases, like num_sucesses, produce random numbers

When the outcomes are numbers, we say that the result is a random variable

There are several functions to generate random variables with different distributions

Random real numbers

So far we used sample to mix integer numbers, and rbinom to simulate number of successes

We can also generate real random numbers, with different distributions

In this course we will use

  • Uniform distribution
  • Normal distribution

Uniform Distribution

We use this method when we only know the range where the random variable can be

To generate 12 random real numbers between 0 and 1, we use

runif(n=12)
 [1] 0.43632997 0.85091620 0.65410589 0.26664259 0.09386964 0.83004632
 [7] 0.23017404 0.81131515 0.82213878 0.76379385 0.02908118 0.74287759

We can choose any range for the numbers

runif(n=12, min=15, max=23)
 [1] 17.65299 17.58306 20.47430 21.57560 15.89224 20.56096 17.84828
 [8] 15.04770 22.23678 20.03353 15.25201 20.63179

Graphically

runif(n=10000, min=15, max=23) %>% hist(nclass=30)

Normal distribution

We use this method when we know that the random variables are located around a mean and have a dispersion sd

To generate 14 random numbers with a Normal distribution centered on 40 with a dispersion of 5, we use

rnorm(n=14, mean=40, sd=5)
 [1] 38.51761 41.32713 32.43647 40.26195 45.84640 40.65946 37.46025
 [8] 47.69482 40.99083 34.53998 42.94042 36.80723 36.84783 42.61253

We will study this case in detail, but later.

Graphically

rnorm(n=1000, mean=40, sd=5) %>% hist(nclass=30)

Summary

  • sample(x,…) can mix any set x of possible outcomes
  • rbinom(n, p, size) gives the number of successes in size trials, n times
  • runif(n, min, max) gives n real numbers between min and max
  • rnorm(n, mean, sd) gives n real numbers centered around mean, with dispersion sd
    • more details later