May 2, 2018

Other ways to get GC content

Remember that we had

gc_content <- function(dna, start, window) {
  freq <- table(dna[seq(from=start, length.out=window)])
  ans <- (freq["g"]+freq["c"])/window
  return(ans)
}

What if there is no "g"? Then freq["g"] is NA and our function does not work

dna=="g" is an Event

We can use sum(dna=="g") to count the number of "g"

gc_content <- function(dna, start, window) {
  fragment <- dna[seq(from=start, length.out=window)]
  ans <- (sum(fragment=="g")+sum(fragment=="c"))/window
  return(ans)
}

Events can be combined

The event we really want to see is when either we have "g" or we have "c"

In R we use | to say “or”

gc_content <- function(dna, start, window) {
  fragment <- dna[seq(from=start, length.out=window)]
  ans <- (sum(fragment=="g" | fragment=="c"))/window
  return(ans)
}

We calculate GC content in windows

GC content is not random

Clearly both plots are different

If gc_cont was random it should look like sim_gc_cont

We can measure “how random” by comparing each gc_cont to all sim_gc_content

Probability of being too small

pval <- rep(NA, length(gc_cont))
for(i in 1:length(gc_cont)) {
    pval[i] <- sum(sim_gc_cont<gc_cont[i])
}
pval <- pval/length(gc_cont)

How many gc_cont are too small?

Now we can see how many elements gc_cont are too small

If all is random, 1% of them should have pval under 1%

sum(pval<0.01)
[1] 871
sum(pval<0.01)/length(pval)
[1] 0.1876751

Too many! We expect 1%, we get 18%

Homework

  • Which are the windows where pval<0.01
  • How do you find the windows where gc_cont is too big
  • What are the genes there?

Atypical GC comes in groups

Finding atypical groups

First, we have to find atypical windows

In this case an atypical window has pval<0.01

max(gc_cont[pval<0.01])
[1] 0.472

So an atypical window has gc_cont<0.471

This is an event. We want to find runs of events

What are the run lengths?

How many consecutive windows are atypical?

What is the length of each run of events?

Think how would you count than

Run length encoding

This question is so common that R has a function for it

Run Length Encoding
Compute the lengths and values of runs of equal values in a vector – or the reverse operation.

Usage: rle(x)

In this case we have…

rle(gc_cont<0.471)
Run Length Encoding
  lengths: int [1:880] 5 1 5 1 5 1 11 1 4 1 ...
  values : logi [1:880] FALSE TRUE FALSE TRUE FALSE TRUE ...

Five FALSE, one TRUE, Five FALSE, …

We only care about TRUE ones

How many of each length?

Let’s save the result so we can handle it

runs <- rle(gc_cont<0.471)
runs$lengths[runs$values]
  [1]  1  1  1  1  1  1  1  1  1  1  1  1  8  1  1  1  2  1  1  1  2  5  1  4  5
 [26]  2  2  2  3  1  1  2  1  1  2  1  1  1  1  1  1  3  1  2  1  1  3  1  1  5
 [51]  1  5  4  1  4  1  2  2  4  1  1  1  1  2  1  1  4  1  1  1  1  2  1  2  1
 [76]  2  1  1  1  3  1  1  1  1  1  1  1  1  3  1  1  1  3  1  1  2  1  3  2  4
[101]  2  2  4  2  1  1  1  1  1  1  2  7 13  1  2  2  1  2  1  1  1  3  1  1  1
[126]  2  1  2  1  1  1  2  2  1  5  1  1  1  2  4  4  4  1  2  1  3  4  3  1  1
[151]  1  1  2  3  3  1  2  1 10  5  3  1  1  4  3  2  3  4  2  3  1  1  1  1  2
[176]  1  1  1  1  2  2  1  1  4  1  1  2  1  3  2  2  1  1  1  1  2  2  2  1  2
[201]  3  1  1  2  1  1  1  1  1  1  3  2  1  2  2  2  3  2  4  2  2  1  2  1  1
[226]  1  2  1 10  1  2  1  1  3  1  1  6  1  1  1  1  1  1  1  2  1  1  2  2  2
[251]  2  3  5  3  1  1  1  2  2  6  4 10  4  1  1  1  1  1  1  1  1  2  1  1  1
[276]  1  1  5  3  3  8  3  1  8  2  1  2  1  1  2  1  1  1 10  1  1  1  2  1  1
[301]  1  1  1  1  1  1  1  1  1  1  1  1  6  1  1  1  1  5  1  1  1  3  2  1  4
[326]  1  1  2  1  2  1  1  1  1  1  1  1  1  1  4  1  2  5  1  1  6  2  1  1  1
[351]  1  3  2  2  1  1  1  1  1  2  1  1  2 12  1  2  1  1  1  1  1  1  1  2  2
[376]  2  1  1  1  1  2  3  4  1  2  1  1  1  1  1  2  1  3  1  2  2  2  2  1  1
[401]  1  1  1  1  1  1  1  1  1  3  1  3  1  1  1  1  1  1  1  1  3  1  4  2  1
[426]  1  6  1  3  5  1  2  1  2  5  1  2  2  1  1

In summary

table(runs$lengths[runs$values])
  1   2   3   4   5   6   7   8  10  12  13 
267  91  34  21  12   5   1   3   4   1   1 

What are the chances of finding groups?

To answer this question we need a different simulation fo E.coli

The histogram of GC content shows the empirical frequency

h <- hist(gc_cont, col="grey")

We can use the frequency to simulate GC content

Looking inside the histogram

h
$breaks
 [1] 0.28 0.30 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48 0.50 0.52 0.54 0.56
[16] 0.58 0.60 0.62 0.64 0.66 0.68

$counts
 [1]   3  10  23  26  50  57  85 169 218 398 591 915 970 711 299  89  14   7   4
[20]   2

$density
 [1]  0.03232062  0.10773540  0.24779142  0.28011204  0.53867701  0.61409179
 [7]  0.91575092  1.82072829  2.34863176  4.28786899  6.36716225  9.85778927
[13] 10.45033398  7.65998707  3.22128852  0.95884508  0.15082956  0.07541478
[19]  0.04309416  0.02154708

$mids
 [1] 0.29 0.31 0.33 0.35 0.37 0.39 0.41 0.43 0.45 0.47 0.49 0.51 0.53 0.55 0.57
[16] 0.59 0.61 0.63 0.65 0.67

$xname
[1] "gc_cont"

$equidist
[1] TRUE

attr(,"class")
[1] "histogram"

h is a list

names(h)
[1] "breaks"   "counts"   "density"  "mids"     "xname"    "equidist"
h$counts
 [1]   3  10  23  26  50  57  85 169 218 398 591 915 970 711 299  89  14   7   4
[20]   2
h$mids
 [1] 0.29 0.31 0.33 0.35 0.37 0.39 0.41 0.43 0.45 0.47 0.49 0.51 0.53 0.55 0.57
[16] 0.59 0.61 0.63 0.65 0.67

Using h to simulate E.coli

x <- sample(h$mids, size=4641, prob=h$counts, replace = TRUE)

We can do it more realistic

Increasing the number of classes

h <- hist(gc_cont, nclass=100, col="grey")

New simulation

x <- sample(h$mids, size=4641, prob=h$counts, replace = TRUE)

Length of runs in the simulation

sim_runs <- rle(x<0.472)
table(sim_runs$lengths[sim_runs$values])
  1   2   3   4 
543 102  19   5 

So, runs longer than 4 are unexpected

Homework

Where are the long runs in the genome?

What are the genes there?