# Assembly statistics

## Assembly Statistics

When we report an assembly result, we describe

• Number of contigs
• Depth of coverage
• N50

## Assembly input parameters

The assembly result depends on

• Length of the Genome: $$G$$
• Length of reads: $$L$$
• Number of reads: $$N$$

In general $$L$$ can be different for each read

For this class we will assume all reads have the same length

The genome length $$G$$ is usually an estimation, based on wet-lab experiments (flow cytometry)

## Depth is a property of each genome position

In practice, some genome parts have coverage 0

We do not see these regions

## Average Depth

The average number of times that a particular nucleotide is represented in a collection of reads

Average Depth is sometimes called Coverage

Depth average is also known as coverage

Coverage can be calculated before sequencing $\text{Coverage} = \frac{NL}{G}$

Percentage of bases that are sequenced a given number of times

Example
genome sequencing 30× average depth can achieve a 95% breadth of coverage of the reference genome at a minimum depth of ten reads

This is the percentage of the genome that we can see with our reads

More precisely, it is the percentage of the genome that has coverage over a minimum

We can only know coverage breadth after we assembly

## Simulate to get a confidence interval

The question we want to answer is

How many reads we need to see all the genome with a minimum depth?

How would you answer that question?

## Two reads connect when they overlap

We only see overlaps over a threshold $$T$$

The best $$T$$ has to be big enough to guarantee that the reads do not overlap by chance

Bigger values of $$T$$ reduce the probability of “overlap by chance”

…they are in the same Contig

A group of contiguous sequences is called Contig

The goal of the assembly process is to find one contig.

The sequence of this contig will be the genome

Most of times we do get several contigs

## Num. contigs depends on 𝐺, 𝐿, 𝑁 and 𝑇

• 𝐿 =100
• 𝐺 =106
• 𝑇 =20

How many reads shall you pay?

• 𝐿 =300
• 𝐺 =106
• 𝑇 =20

## This is described in the paper

Lander ES, Waterman MS. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics. 1988 Apr; 2(3):231-9. doi: 10.1016/0888-7543(88)90007-9. PMID: 3294162.

## Contig length

With this simulation we can also calculate the length of each contig.

## N50

Given a set of contigs, the N50 is defined as the sequence length of the shortest contig at 50% of the total genome length

We sort the contigs from largest to smallest

## When do we get 50%?

We identify which contig crosses the 50% line

N50 is the length of that contig

## Which contig reach 50% of bases?

sorted_len pct_in_contig cumulative
3201 42.95 42.95
1324 17.77 60.72
1188 15.94 76.66
849 11.39 88.06
466 6.25 94.31
229 3.07 97.38
195 2.62 100.00

# Assembly quality

## Each fragment has two reads

• Both reads should point to each other (AF, AR)
• If they point in the wrong direction, it is bad
• The distance between both reads should not be too large or too small

## Scaffolding

If a fragment has one read in Contig 1 and the other in Contig 2, then we know that the contigs are close

## Scaffolding

More shared fragments gives more confidence to the scaffold

## Scaffolding

Shared fragments allow us to find the relative orientation of contigs, and make a scaffold of contigs and gaps

## The assembly can be wrong

Given the information we have, there are at least two solutions

We cannot do better with the available information

For example, longer reads containing the repeat and its context

Or read pairs from larger fragments, so each read is outside the repeat

# Another way to assemble

## The problem

Assemblers based on Overlay–Layout–Consensus cannot handle repeats

Moreover, they can handle only a few thousand reads

NGS produces millions of reads, so a different approach was developed

Since a read has hundreds of bp, it can contain part of a repeat

It is better to use shorter sequences (𝑘-mers), that are either inside or outside the repeat

𝑘 is typically between 20 and 130, and can be chosen automatically

## Each read is translated into a list of 𝑘-mers

ATGCATATATAGCA
ATG
TGC
GCA
CAT
ATA
TAT
ATA
TAT
ATA
TAG
AGC
GCA

## We do not count repeated 𝑘-mers

ATGCATATATAGCA
ATG
TGC
GCA
CAT
ATA
TAT
TAG
AGC
GCA

## De Bruijn’s graphs

Each 𝑘-mer is a node in a directed graph

Two nodes are connected if the last (𝑘-1) bp of the first 𝑘-mer are the same as the first (𝑘-1) bp of the second one

ATGCATATATAGCA
ATG TGC GCA CAT ATA TAT TAG AGC GCA

## We get an honest assembly

This approach does not solve the repeats

Instead, it shows the repeats clearly

This way we know what are the issues, and we can design an experiment (PCR?) to solve them

## Output format: FASTG

Like FAST, but showing also the graph structure. For example

>EDGE_641517_length_474_cov_1.855908;
AACACTGATTGCCTCCCCCCCGTTGATGGGTAAAATAGCCGCAATTTTTCGTTTTCAACA
[…]
GCTGCCTGATGGTTATCGACGCTGCAAAAGGTGTTGAAGATCGTACCCGTAAGC
>EDGE_621787_length_514_cov_1.860465';
TGTCGATGCGGTGTACATTGTGGCAACGCCGGGTGAAATCGCTTTTATCAAACCGATGAT
[…]
TGGCTGGAAGGCAAAGGACTGCGGTTTATCGCCG
>EDGE_678376_length_822_cov_333.633094:EDGE_679076_length_4092_cov_132.576797',EDGE_679634_length_28752_cov_122.881432;
GGCACTGTTGCAAATAGTCGGTGGTGATAAACTTATCATCCCCTTTTGCTGATGGAGCTG
[…]
AGACAAAAGGCTGCCTCATCGCTAACTTTGCAACAGTGCCGG

## Bandage: Look and edit the graph

Bandage is a program to visualize a assembly graph

Example from The New York Times: “Team of Rival Scientists Comes Together to Fight Zika”. March 30, 2016