November 26, 2019

What is the question?

We have reads that are close related to a reference genome

For example, RNA messenger from the same individual

Or DNA from a new individual from the same species

Or DNA from a species on the same genus

Using bwa

As we discussed previously, bwa uses an index to speedup the search

So the first step is

bwa index yeast.fasta

List the files in your folder and verify that there are index files

Doing the real alignment

Let’s first align only one set of reads

$ bwa mem yeast.fasta y1.fastq -o out1.sam

The output is in Sequence Alignment/Map file format

We use the extension .sam

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 25000 sequences (900000 bp)...
[M::mem_process_seqs] Processed 25000 reads in 0.867 CPU sec, 0.867 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -o out1.sam yeast.fasta y1.fastq
[main] Real time: 0.961 sec; CPU: 0.939 sec

What is inside the .sam file?

@SQ     SN:XI   LN:666816
@SQ     SN:IX   LN:439888
@SQ     SN:IV   LN:1531933
@SQ     SN:III  LN:316620
@SQ     SN:X    LN:745751
@SQ     SN:XII  LN:1078177
@SQ     SN:VIII LN:562643
@SQ     SN:VI   LN:270161
@SQ     SN:V    LN:576874
@SQ     SN:II   LN:813184
@SQ     SN:XIV  LN:784333
@SQ     SN:XIII LN:924431
@SQ     SN:XVI  LN:948066
@SQ     SN:XV   LN:1091291
@SQ     SN:I    LN:230218
@SQ     SN:Mito LN:85779
@SQ     SN:VII  LN:1090940
@PG     ID:bwa  PN:bwa  VN:0.7.17-r1188 CL:bwa mem yeast.fasta y1.fastq -o out1.sam
SRR507778.1     0       Mito    20158   60      36M     *       0       0       ANTATAATATTATCCCCACGAGGGCCACACATGTGT    ?#?;<BBB?BBGGEGGGGDDE8EEDB:?<=?BB?B7    NM:i:1  MD:Z:1A34       AS:i:34 XS:i:0
SRR507778.2     16      XV      767197  60      36M     *       0       0       ACCCATTTCATAATTAATAAGTGGTATATCAGGANG    <?DGEDBHHHHHHHDHFGFHFGFGHG@@A=@;;7#:    NM:i:1  MD:Z:34C1       AS:i:34 XS:i:0
SRR507778.3     0       II      569438  60      36M     *       0       0       GNAAGAACCCTTATACTAACAGCAGCATAGGAAAGT    ;#>45<4454GEDGGGGGGECEACBBBB=FGADAG8    NM:i:1  MD:Z:1C34       AS:i:34 XS:i:0

What does this mean?

Lines starting with @ are metadata

They describe the conditions of the alignment

Line like this

@SQ     SN:XI   LN:666816
@SQ     SN:IX   LN:439888
@SQ     SN:VII  LN:1090940

describe the sequence

What does this mean?

Line like this

@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem yeast.fasta
 y1.fastq -o out1.sam

describe how the program was executed. It includes

  • ID: id of the command
  • PN: program name
  • VN: version number
  • CL: command line

What is the alignment data?

Many lines like this:

SRR507778.1     0       Mito    20158   60      36M     *       0       0
        ANTATAATATTATCCCCACGAGGGCCACACATGTGT
        ?#?;<BBB?BBGGEGGGGDDE8EEDB:?<=?BB?B7
        NM:i:1  MD:Z:1A34       AS:i:34 XS:i:0
SRR507778.2     16      XV      767197  60      36M     *       0       0
        ACCCATTTCATAATTAATAAGTGGTATATCAGGANG
        <?DGEDBHHHHHHHDHFGFHFGFGHG@@A=@;;7#:
        NM:i:1  MD:Z:34C1       AS:i:34 XS:i:0
SRR507778.3     0       II      569438  60      36M     *       0       0
        GNAAGAACCCTTATACTAACAGCAGCATAGGAAAGT
        ;#>45<4454GEDGGGGGGECEACBBBB=FGADAG8
        NM:i:1  MD:Z:1C34       AS:i:34 XS:i:0

What is the alignment data?

Sequence Alignment/Map (SAM) format is TAB-delimited.

1    QNAME   Query template/pair NAME
2    FLAG    bitwise FLAG
3    RNAME   Reference sequence NAME
4    POS     1-based leftmost POSition/coordinate of clipped sequence
5    MAPQ    MAPping Quality (Phred-scaled)
6    CIGAR   extended CIGAR string
7    MRNM    Mate Reference sequence NaMe (`=' if same as RNAME)
8    MPOS    1-based Mate POSistion
9    TLEN    inferred Template LENgth (insert size)
10   SEQ     query SEQuence on the same strand as the reference
11   QUAL    query QUALity (ASCII-33 gives the Phred base quality)
12+  OPT     variable OPTional fields in the format TAG:VTYPE:VALUE

What is the flag?

Each bit in the FLAG field is defined as:

1    p   the read is paired in sequencing
2    P   the read is mapped in a proper pair
4    u   the query sequence itself is unmapped
8    U   the mate is unmapped
16   r   strand of the query (1 for reverse)
32   R   strand of the mate
64   1   the read is the first read in a pair
128  2   the read is the second read in a pair
256  s   the alignment is not primary
512  f   the read fails platform/vendor quality checks
1024 d   the read is either a PCR or an optical duplicate
2048 S   the alignment is supplementary

What is a bit flag?

The FLAG column is a number

It is the sum of numbers on the previous table

For example

SRR507778.1     0       ...
SRR507778.2     16      ....

The first read is not paired, and aligned in the Forward strand

The second read is not paired, and aligned in the Reverse strand

Why there are only 0 and 16 ?

Most of the flags only make sense when we combine both read files

Let’s do that

bwa mem yeast.fasta y1.fastq y2.fastq > out2.sam

Screen messages

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 50000 sequences (1800000 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (103, 4007, 13219, 91)
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (240, 2467, 3114)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 8862)
[M::mem_pestat] mean and std.dev: (1893.02, 1414.42)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 11736)
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (243, 275, 314)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (101, 456)
[M::mem_pestat] mean and std.dev: (278.34, 50.30)
[M::mem_pestat] low and high boundaries for proper pairs: (30, 527)
[M::mem_pestat] analyzing insert size distribution for orientation RF...
[M::mem_pestat] (25, 50, 75) percentile: (3111, 3320, 3565)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (2203, 4473)
[M::mem_pestat] mean and std.dev: (3352.46, 321.09)
[M::mem_pestat] low and high boundaries for proper pairs: (1749, 4927)
[M::mem_pestat] analyzing insert size distribution for orientation RR...
[M::mem_pestat] (25, 50, 75) percentile: (291, 2611, 3188)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 8982)
[M::mem_pestat] mean and std.dev: (2036.04, 1501.65)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 11879)
[M::mem_pestat] skip orientation FF
[M::mem_pestat] skip orientation RR
[M::mem_process_seqs] Processed 50000 reads in 6.397 CPU sec, 6.396 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem yeast.fasta y1.fastq y2.fastq
[main] Real time: 6.557 sec; CPU: 6.503 sec

New SAM file (first 4 lines)

SRR507778.1     99      Mito    20158   60   36M  =  16941   -3183
   ANTATAATATTATCCCCACGAGGGCCACACATGTGT
   ?#?;<BBB?BBGGEGGGGDDE8EEDB:?<=?BB?B7
   NM:i:1 MD:Z:1A34 MC:Z:36M AS:i:34 XS:i:0
SRR507778.1     147     Mito    16941   60   36M  =  20158   3183
    TTCATAGTACCCAAATTTAATTTAAATAAAGTGAGA
    GFFCFDDG<DIIGHIIEEE>EBAGG@GBDDDBG??B
    NM:i:0 MD:Z:36 MC:Z:36M  AS:i:36 XS:i:0
SRR507778.2     83      XV      767197  60   36M  =  767020  -213
    ACCCATTTCATAATTAATAAGTGGTATATCAGGANG
    <?DGEDBHHHHHHHDHFGFHFGFGHG@@A=@;;7#:
    NM:i:1 MD:Z:34C1 MC:Z:36M AS:i:34 XS:i:0
SRR507778.2     163     XV      767020  60   36M  =  767197  213
    CTAGTCATTTTTCTAACACCGAAATAACGGCAGAGC
    GIIIIG:IIIIIIIIIHIIIGG?GDEGDGGHIHBII
    NM:i:0  MD:Z:36 MC:Z:36M AS:i:36 XS:i:0

Now FLAGS is more interesting

99 = 64+32+2+1
  • 1 : the read is paired in sequencing
  • 2 : the read is mapped in a proper pair
  • 32 : strand of the mate
  • 64 : the read is the first read in a pair
147 = 128+16+2+1
  • 1 : the read is paired in sequencing
  • 2 : the read is mapped in a proper pair
  • 16 : reverse strand
  • 128 : the read is the second read in a pair

What do we mean by “Paired”?

Each read has its “pair”

fragment      ========================================
SE read       --------->
PE reads    R1--------->                    <---------R2
unknown gap             ....................
SE
Single end read
PE
Paired end read

How does BWA know which reads are part of a pair?

The two FASTQ files need to have the same number of reads

The first read of one file matches the first read of the other file

The n-th read of one file matches the n-th read of the other file

What do we mean by “proper paired”?

Basically

  • Both pairs are in the same chromosome
  • both pairs are in opposite strands
  • both pairs are facing each other
Good    ---->      <----
bad     <----      ---->
bad     ---->      ---->

Now FLAGS is more interesting

83 = 64+16+2+1
  • 1 : the read is paired in sequencing
  • 2 : the read is mapped in a proper pair
  • 16 : reverse strand
  • 64 : the read is the first read in a pair
163 = 128+32+2+1
  • 1 : the read is paired in sequencing
  • 2 : the read is mapped in a proper pair
  • 32 : strand of the mate
  • 128 : the read is the second read in a pair