May 2, 2018

## Other ways to get GC content

Remember that we had

gc_content <- function(dna, start, window) {
freq <- table(dna[seq(from=start, length.out=window)])
ans <- (freq["g"]+freq["c"])/window
return(ans)
}

What if there is no "g"? Then freq["g"] is NA and our function does not work

## dna=="g" is an Event

We can use sum(dna=="g") to count the number of "g"

gc_content <- function(dna, start, window) {
fragment <- dna[seq(from=start, length.out=window)]
ans <- (sum(fragment=="g")+sum(fragment=="c"))/window
return(ans)
}

## Events can be combined

The event we really want to see is when either we have "g" or we have "c"

In R we use | to say “or”

gc_content <- function(dna, start, window) {
fragment <- dna[seq(from=start, length.out=window)]
ans <- (sum(fragment=="g" | fragment=="c"))/window
return(ans)
}

## GC content is not random

Clearly both plots are different

If gc_cont was random it should look like sim_gc_cont

We can measure “how random” by comparing each gc_cont to all sim_gc_content

## Probability of being too small

pval <- rep(NA, length(gc_cont))
for(i in 1:length(gc_cont)) {
pval[i] <- sum(sim_gc_cont<gc_cont[i])
}
pval <- pval/length(gc_cont)

## How many gc_cont are too small?

Now we can see how many elements gc_cont are too small

If all is random, 1% of them should have pval under 1%

sum(pval<0.01)
[1] 871
sum(pval<0.01)/length(pval)
[1] 0.1876751

Too many! We expect 1%, we get 18%

## Homework

• Which are the windows where pval<0.01
• How do you find the windows where gc_cont is too big
• What are the genes there?

## Finding atypical groups

First, we have to find atypical windows

In this case an atypical window has pval<0.01

max(gc_cont[pval<0.01])
[1] 0.472

So an atypical window has gc_cont<0.471

This is an event. We want to find runs of events

## What are the run lengths?

How many consecutive windows are atypical?

What is the length of each run of events?

Think how would you count than

## Run length encoding

This question is so common that R has a function for it

Run Length Encoding
Compute the lengths and values of runs of equal values in a vector – or the reverse operation.

Usage: rle(x)

## In this case we have…

rle(gc_cont<0.471)
Run Length Encoding
lengths: int [1:880] 5 1 5 1 5 1 11 1 4 1 ...
values : logi [1:880] FALSE TRUE FALSE TRUE FALSE TRUE ...

Five FALSE, one TRUE, Five FALSE, …

We only care about TRUE ones

## How many of each length?

Let’s save the result so we can handle it

runs <- rle(gc_cont<0.471)
runs$lengths[runs$values]
  [1]  1  1  1  1  1  1  1  1  1  1  1  1  8  1  1  1  2  1  1  1  2  5  1  4  5
[26]  2  2  2  3  1  1  2  1  1  2  1  1  1  1  1  1  3  1  2  1  1  3  1  1  5
[51]  1  5  4  1  4  1  2  2  4  1  1  1  1  2  1  1  4  1  1  1  1  2  1  2  1
[76]  2  1  1  1  3  1  1  1  1  1  1  1  1  3  1  1  1  3  1  1  2  1  3  2  4
[101]  2  2  4  2  1  1  1  1  1  1  2  7 13  1  2  2  1  2  1  1  1  3  1  1  1
[126]  2  1  2  1  1  1  2  2  1  5  1  1  1  2  4  4  4  1  2  1  3  4  3  1  1
[151]  1  1  2  3  3  1  2  1 10  5  3  1  1  4  3  2  3  4  2  3  1  1  1  1  2
[176]  1  1  1  1  2  2  1  1  4  1  1  2  1  3  2  2  1  1  1  1  2  2  2  1  2
[201]  3  1  1  2  1  1  1  1  1  1  3  2  1  2  2  2  3  2  4  2  2  1  2  1  1
[226]  1  2  1 10  1  2  1  1  3  1  1  6  1  1  1  1  1  1  1  2  1  1  2  2  2
[251]  2  3  5  3  1  1  1  2  2  6  4 10  4  1  1  1  1  1  1  1  1  2  1  1  1
[276]  1  1  5  3  3  8  3  1  8  2  1  2  1  1  2  1  1  1 10  1  1  1  2  1  1
[301]  1  1  1  1  1  1  1  1  1  1  1  1  6  1  1  1  1  5  1  1  1  3  2  1  4
[326]  1  1  2  1  2  1  1  1  1  1  1  1  1  1  4  1  2  5  1  1  6  2  1  1  1
[351]  1  3  2  2  1  1  1  1  1  2  1  1  2 12  1  2  1  1  1  1  1  1  1  2  2
[376]  2  1  1  1  1  2  3  4  1  2  1  1  1  1  1  2  1  3  1  2  2  2  2  1  1
[401]  1  1  1  1  1  1  1  1  1  3  1  3  1  1  1  1  1  1  1  1  3  1  4  2  1
[426]  1  6  1  3  5  1  2  1  2  5  1  2  2  1  1

## In summary

table(runs$lengths[runs$values])
  1   2   3   4   5   6   7   8  10  12  13
267  91  34  21  12   5   1   3   4   1   1 

## What are the chances of finding groups?

To answer this question we need a different simulation fo E.coli

The histogram of GC content shows the empirical frequency

h <- hist(gc_cont, col="grey")

## We can use the frequency to simulate GC content

Looking inside the histogram

h
$breaks [1] 0.28 0.30 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48 0.50 0.52 0.54 0.56 [16] 0.58 0.60 0.62 0.64 0.66 0.68$counts
[1]   3  10  23  26  50  57  85 169 218 398 591 915 970 711 299  89  14   7   4
[20]   2

$density [1] 0.03232062 0.10773540 0.24779142 0.28011204 0.53867701 0.61409179 [7] 0.91575092 1.82072829 2.34863176 4.28786899 6.36716225 9.85778927 [13] 10.45033398 7.65998707 3.22128852 0.95884508 0.15082956 0.07541478 [19] 0.04309416 0.02154708$mids
[1] 0.29 0.31 0.33 0.35 0.37 0.39 0.41 0.43 0.45 0.47 0.49 0.51 0.53 0.55 0.57
[16] 0.59 0.61 0.63 0.65 0.67

$xname [1] "gc_cont"$equidist
[1] TRUE

attr(,"class")
[1] "histogram"

## h is a list

names(h)
[1] "breaks"   "counts"   "density"  "mids"     "xname"    "equidist"
h$counts  [1] 3 10 23 26 50 57 85 169 218 398 591 915 970 711 299 89 14 7 4 [20] 2 h$mids
 [1] 0.29 0.31 0.33 0.35 0.37 0.39 0.41 0.43 0.45 0.47 0.49 0.51 0.53 0.55 0.57
[16] 0.59 0.61 0.63 0.65 0.67

## Using h to simulate E.coli

x <- sample(h$mids, size=4641, prob=h$counts, replace = TRUE)

## We can do it more realistic

Increasing the number of classes

h <- hist(gc_cont, nclass=100, col="grey")

## New simulation

x <- sample(h$mids, size=4641, prob=h$counts, replace = TRUE)

## Length of runs in the simulation

sim_runs <- rle(x<0.472)
table(sim_runs$lengths[sim_runs$values])
  1   2   3   4
543 102  19   5 

So, runs longer than 4 are unexpected

## Homework

Where are the long runs in the genome?

What are the genes there?