When we report an assembly result, we describe
The assembly result depends on
In general \(L\) can be different for each read
For this class we will assume all reads have the same length
The genome length \(G\) is usually an estimation, based on wet-lab experiments (flow cytometry)


In practice, some genome parts have coverage 0
We do not see these regions

The average number of times that a particular nucleotide is represented in a collection of reads
Average Depth is sometimes called Coverage
Depth average is also known as coverage
Coverage can be calculated before sequencing \[ \text{Coverage} = \frac{NL}{G} \]
Percentage of bases that are sequenced a given number of times
This is the percentage of the genome that we can see with our reads
More precisely, it is the percentage of the genome that has coverage over a minimum
We can only know coverage breadth after we assembly


The question we want to answer is
How many reads we need to see all the genome with a minimum depth?
How would you answer that question?
We only see overlaps over a threshold \(T\)
The best \(T\) has to be big enough to guarantee that the reads do not overlap by chance
Bigger values of \(T\) reduce the probability of “overlap by chance”
…they are in the same Contig
A group of contiguous sequences is called Contig
The goal of the assembly process is to find one contig.
The sequence of this contig will be the genome
Most of times we do get several contigs

How many reads shall you pay?

Longer reads are better
Lander ES, Waterman MS. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics. 1988 Apr; 2(3):231-9. doi: 10.1016/0888-7543(88)90007-9. PMID: 3294162.

With this simulation we can also calculate the length of each contig.

Given a set of contigs, the N50 is defined as the sequence length of the shortest contig at 50% of the total genome length
We sort the contigs from largest to smallest


We identify which contig crosses the 50% line
N50 is the length of that contig
| sorted_len | pct_in_contig | cumulative | 
|---|---|---|
| 3201 | 42.95 | 42.95 | 
| 1324 | 17.77 | 60.72 | 
| 1188 | 15.94 | 76.66 | 
| 849 | 11.39 | 88.06 | 
| 466 | 6.25 | 94.31 | 
| 229 | 3.07 | 97.38 | 
| 195 | 2.62 | 100.00 | 

If a fragment has one read in Contig 1 and the other in Contig 2, then we know that the contigs are close

More shared fragments gives more confidence to the scaffold

Shared fragments allow us to find the relative orientation of contigs, and make a scaffold of contigs and gaps 

Given the information we have, there are at least two solutions

We cannot do better with the available information
To decide the correct one, we need more information
For example, longer reads containing the repeat and its context

Or read pairs from larger fragments, so each read is outside the repeat
Assemblers based on Overlay–Layout–Consensus cannot handle repeats
Moreover, they can handle only a few thousand reads
NGS produces millions of reads, so a different approach was developed
Since a read has hundreds of bp, it can contain part of a repeat
It is better to use shorter sequences (𝑘-mers), that are either inside or outside the repeat
𝑘 is typically between 20 and 130, and can be chosen automatically
ATGCATATATAGCA
ATG       
 TGC      
  GCA     
   CAT    
    ATA   
     TAT  
      ATA 
       TAT
        ATA
         TAG
          AGC
           GCAATGCATATATAGCA
ATG       
 TGC      
  GCA     
   CAT    
    ATA   
     TAT  
         TAG
          AGC
           GCAEach 𝑘-mer is a node in a directed graph
Two nodes are connected if the last (𝑘-1) bp of the first 𝑘-mer are the same as the first (𝑘-1) bp of the second one
ATGCATATATAGCA
ATG TGC GCA CAT ATA TAT TAG AGC GCAThis approach does not solve the repeats
Instead, it shows the repeats clearly

This way we know what are the issues, and we can design an experiment (PCR?) to solve them

Like FAST, but showing also the graph structure. For example
>EDGE_641517_length_474_cov_1.855908;
AACACTGATTGCCTCCCCCCCGTTGATGGGTAAAATAGCCGCAATTTTTCGTTTTCAACA
[…]
GCTGCCTGATGGTTATCGACGCTGCAAAAGGTGTTGAAGATCGTACCCGTAAGC
>EDGE_621787_length_514_cov_1.860465';
TGTCGATGCGGTGTACATTGTGGCAACGCCGGGTGAAATCGCTTTTATCAAACCGATGAT
[…]
TGGCTGGAAGGCAAAGGACTGCGGTTTATCGCCG
>EDGE_678376_length_822_cov_333.633094:EDGE_679076_length_4092_cov_132.576797',EDGE_679634_length_28752_cov_122.881432;
GGCACTGTTGCAAATAGTCGGTGGTGATAAACTTATCATCCCCTTTTGCTGATGGAGCTG
[…]
AGACAAAAGGCTGCCTCATCGCTAACTTTGCAACAGTGCCGGBandage is a program to visualize a assembly graph
Example from The New York Times: “Team of Rival Scientists Comes Together to Fight Zika”. March 30, 2016
