October 23, 2018

Sequencing

  • Fragments
    • Fragment size
  • Libraries
  • Adaptors, vectors
  • Reads
    • Read length
    • Single end
    • Paired end
    • Mate pairs

Sequencing

  • Base calling
  • Error rate
  • Phred Quality
  • Trimming by quality
  • Clipping adaptors/vectors/tags
  • Run

Sequencing methods

  • Sanger
  • Illumina
  • 454
  • SOLID
  • PacBio
  • nanoball
  • Ion Torrent

Important facts

  • Run size
  • Run price
  • Read length
  • Error rate

Assembly

  • de novo v/s reference based
  • Overlay-layout-consensus
  • Overlap
  • Overlap Threshold
  • Contigs
  • Scaffolds
  • Repeats

Lander-Waterman formula

  • Genome Length \(G\)
  • Read Length \(L\)
  • Number of reads \(N\)
  • Overlap Threshold \(T\)
  • Expected number of contigs

\[N\exp\left(-\frac{(L-T)N}{G}\right)\]

Minimum probability of misassemble

If \(T\) is too small, we risk confusing two parts of the genome

Before any biological considerations, the expected number of times we see a repeat of size \(T\) just by chance, is \(G/4^k\)

Thus, the probability that two reads match wrongly is \(4^{-T}\). And there are almost \(N^2\) pairs of reads

Expected number of chimeras: \(N^2 4^{-T}\)

Contig Statistics

  • Depth
  • Breadth of Coverage
  • N50

File Formats

  • FASTA
  • FASTQ
  • SCF
  • SAM
  • BAM

Computational cost

  • Cost of search without index
  • Cost of search using an index
  • Cost of making an index
  • Space required to make an index