September 28th, 2018

Depth

The average number of times that a particular nucleotide is represented in a collection of reads

Average Depth is sometimes called Coverage

Breadth of Coverage

Percentage of bases that are sequenced a given number of times

Example
genome sequencing 30× average depth can achieve a 95% breadth of coverage of the reference genome at a minimum depth of ten reads

Let’s simulate it in R

Assembly Statistics

The number of contigs depends on

  • Genome length: G
  • Length of reads: L
  • Number of reads: N

In general L can be different for each read. In this simulation we will assume all reads have the same length

In Pictures

Example values for G, N, and L

  • Genome: Only one chromosome, length G, let’s say a virus
G <- 1000
  • reads: N of them, let’s say 10, each of length L
N <- 10
L <- 100

We already know the coverage

The total number of nucleotides we got is

N*L
[1] 1000

Thus, the average depth (coverage) is

N*L/G
[1] 1

Reads are randomly taken from the genome

  • Each read has a start and end position on the genome
  • We take start randomly, sort it, then we calculate end
start <- sample.int(G, size=N)
end <- start + L

Each read has start and end

Exercise: calculate depth

Initially the genome is not covered

depth <- rep(0, G)
par(mar=c(7,4,2,2)+0.1)
plot(depth, type = "l", ylim=c(0,5))

Then we read the first fragment

read_pos <- start[1]:min(end[1], G)
depth[read_pos] <- depth[read_pos] + 1
plot(depth, type = "l")

We do the same with the rest of the reads

for(i in 2:N) {
    # we assume Linear Chromosome
    read_pos <- start[i]:min(end[i], G)
    depth[read_pos] <- depth[read_pos] + 1
}

Sometimes end[i] can be greater than G. Then part of the read is outside the chromosome. We only see the inside part.

How would you handle a circular genome?

Sequencing depth with all reads

plot(depth, type = "l")

Different parts have different depth

barplot(table(depth), xlab="depth",ylab="Num bases")

If depth is 0, then we did not see that part of the genome

Breadth of Coverage: how much did we see?

What percentage of the genome did we see?

sum(depth > 0) / G
[1] 0.659

In this case we use the theoretical value of G, since we do not know the real genome (yet)

Breadth of Coverage for other depths

What percentage of the genome has depth 2 or more?

sum(depth >= 2) / G
[1] 0.22