When we report an assembly result, we describe
The assembly result depends on
In general \(L\) can be different for each read
For this class we will assume all reads have the same length
The genome length \(G\) is usually an estimation, based on wet-lab experiments (flow cytometry)
In practice, some genome parts have coverage 0
We do not see these regions
The average number of times that a particular nucleotide is represented in a collection of reads
Average Depth is sometimes called Coverage
Depth average is also known as coverage
Coverage can be calculated before sequencing \[ \text{Coverage} = \frac{NL}{G} \]
Percentage of bases that are sequenced a given number of times
This is the percentage of the genome that we can see with our reads
More precisely, it is the percentage of the genome that has coverage over a minimum
We can only know coverage breadth after we assembly
The question we want to answer is
How many reads we need to see all the genome with a minimum depth?
How would you answer that question?
We only see overlaps over a threshold \(T\)
The best \(T\) has to be big enough to guarantee that the reads do not overlap by chance
Bigger values of \(T\) reduce the probability of “overlap by chance”
…they are in the same Contig
A group of contiguous sequences is called Contig
The goal of the assembly process is to find one contig.
The sequence of this contig will be the genome
Most of times we do get several contigs
How many reads shall you pay?
Longer reads are better
Lander ES, Waterman MS. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics. 1988 Apr; 2(3):231-9. doi: 10.1016/0888-7543(88)90007-9. PMID: 3294162.
With this simulation we can also calculate the length of each contig.
Given a set of contigs, the N50 is defined as the sequence length of the shortest contig at 50% of the total genome length
We sort the contigs from largest to smallest
We identify which contig crosses the 50% line
N50 is the length of that contig
sorted_len | pct_in_contig | cumulative |
---|---|---|
3201 | 42.95 | 42.95 |
1324 | 17.77 | 60.72 |
1188 | 15.94 | 76.66 |
849 | 11.39 | 88.06 |
466 | 6.25 | 94.31 |
229 | 3.07 | 97.38 |
195 | 2.62 | 100.00 |
If a fragment has one read in Contig 1 and the other in Contig 2, then we know that the contigs are close
More shared fragments gives more confidence to the scaffold
Shared fragments allow us to find the relative orientation of contigs, and make a scaffold of contigs and gaps
Given the information we have, there are at least two solutions
We cannot do better with the available information
To decide the correct one, we need more information
For example, longer reads containing the repeat and its context
Or read pairs from larger fragments, so each read is outside the repeat
Assemblers based on Overlay–Layout–Consensus cannot handle repeats
Moreover, they can handle only a few thousand reads
NGS produces millions of reads, so a different approach was developed
Since a read has hundreds of bp, it can contain part of a repeat
It is better to use shorter sequences (𝑘-mers), that are either inside or outside the repeat
𝑘 is typically between 20 and 130, and can be chosen automatically
ATGCATATATAGCA
ATG
TGC
GCA
CAT
ATA
TAT
ATA
TAT
ATA
TAG
AGC
GCA
ATGCATATATAGCA
ATG
TGC
GCA
CAT
ATA
TAT
TAG
AGC
GCA
Each 𝑘-mer is a node in a directed graph
Two nodes are connected if the last (𝑘-1) bp of the first 𝑘-mer are the same as the first (𝑘-1) bp of the second one
ATGCATATATAGCA
ATG TGC GCA CAT ATA TAT TAG AGC GCA
This approach does not solve the repeats
Instead, it shows the repeats clearly
This way we know what are the issues, and we can design an experiment (PCR?) to solve them
Like FAST, but showing also the graph structure. For example
>EDGE_641517_length_474_cov_1.855908;
AACACTGATTGCCTCCCCCCCGTTGATGGGTAAAATAGCCGCAATTTTTCGTTTTCAACA
[…]
GCTGCCTGATGGTTATCGACGCTGCAAAAGGTGTTGAAGATCGTACCCGTAAGC
>EDGE_621787_length_514_cov_1.860465';
TGTCGATGCGGTGTACATTGTGGCAACGCCGGGTGAAATCGCTTTTATCAAACCGATGAT
[…]
TGGCTGGAAGGCAAAGGACTGCGGTTTATCGCCG
>EDGE_678376_length_822_cov_333.633094:EDGE_679076_length_4092_cov_132.576797',EDGE_679634_length_28752_cov_122.881432;
GGCACTGTTGCAAATAGTCGGTGGTGATAAACTTATCATCCCCTTTTGCTGATGGAGCTG
[…]
AGACAAAAGGCTGCCTCATCGCTAACTTTGCAACAGTGCCGG
Bandage is a program to visualize a assembly graph
Example from The New York Times: “Team of Rival Scientists Comes Together to Fight Zika”. March 30, 2016