We have reads that are close related to a reference genome
For example, RNA messengers
Or DNA from a new individual from the same species
Or DNA from a species on the same genus
Or reads that we want to map to an assembled genome
The answer depends on the case
There are several tool for the same goal
Two tools that are popular today are:
They have a similar philosophy
Can you find others?
bwa
First, it makes an index of the genome
bwa index ref.fa
Then it aligns the reads to the genome
bwa mem ref.fa reads.fq > aln-se.sam
We use the extension .sam
. These are large text files
Sometimes .sam
files are encoded in smaller .bam
binary format files
.sam
file?@SQ SN:XI LN:666816
@SQ SN:IX LN:439888
@SQ SN:IV LN:1531933
@SQ SN:III LN:316620
@SQ SN:X LN:745751
@SQ SN:XII LN:1078177
@SQ SN:VIII LN:562643
@SQ SN:VI LN:270161
@SQ SN:V LN:576874
@SQ SN:II LN:813184
@SQ SN:XIV LN:784333
@SQ SN:XIII LN:924431
@SQ SN:XVI LN:948066
@SQ SN:XV LN:1091291
@SQ SN:I LN:230218
@SQ SN:Mito LN:85779
@SQ SN:VII LN:1090940
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem yeast.fasta y1.fastq -o out1.sam
SRR507778.1 0 Mito 20158 60 36M * 0 0 ANTATAATATTATCCCCACGAGGGCCACACATGTGT ?#?;<BBB?BBGGEGGGGDDE8EEDB:?<=?BB?B7 NM:i:1 MD:Z:1A34 AS:i:34 XS:i:0
SRR507778.2 16 XV 767197 60 36M * 0 0 ACCCATTTCATAATTAATAAGTGGTATATCAGGANG <?DGEDBHHHHHHHDHFGFHFGFGHG@@A=@;;7#: NM:i:1 MD:Z:34C1 AS:i:34 XS:i:0
SRR507778.3 0 II 569438 60 36M * 0 0 GNAAGAACCCTTATACTAACAGCAGCATAGGAAAGT ;#>45<4454GEDGGGGGGECEACBBBB=FGADAG8 NM:i:1 MD:Z:1C34 AS:i:34 XS:i:0
Lines starting with @
are metadata
They describe the conditions of the alignment
Line like this
@SQ SN:XI LN:666816
@SQ SN:IX LN:439888
@SQ SN:VII LN:1090940
describe the sequence
Line like this
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem yeast.fasta
y1.fastq -o out1.sam
describe how the program was executed. It includes
Many lines like this:
SRR507778.1 0 Mito 20158 60 36M * 0 0
ANTATAATATTATCCCCACGAGGGCCACACATGTGT
?#?;<BBB?BBGGEGGGGDDE8EEDB:?<=?BB?B7
NM:i:1 MD:Z:1A34 AS:i:34 XS:i:0
SRR507778.2 16 XV 767197 60 36M * 0 0
ACCCATTTCATAATTAATAAGTGGTATATCAGGANG
<?DGEDBHHHHHHHDHFGFHFGFGHG@@A=@;;7#:
NM:i:1 MD:Z:34C1 AS:i:34 XS:i:0
SRR507778.3 0 II 569438 60 36M * 0 0
GNAAGAACCCTTATACTAACAGCAGCATAGGAAAGT
;#>45<4454GEDGGGGGGECEACBBBB=FGADAG8
NM:i:1 MD:Z:1C34 AS:i:34 XS:i:0
Sequence Alignment/Map (SAM) format is TAB-delimited.
1 QNAME Query template/pair NAME
2 FLAG bitwise FLAG
3 RNAME Reference sequence NAME
4 POS 1-based leftmost POSition/coordinate of clipped sequence
5 MAPQ MAPping Quality (Phred-scaled)
6 CIGAR extended CIGAR string
7 MRNM Mate Reference sequence NaMe (`=' if same as RNAME)
8 MPOS 1-based Mate POSistion
9 TLEN inferred Template LENgth (insert size)
10 SEQ query SEQuence on the same strand as the reference
11 QUAL query QUALity (ASCII-33 gives the Phred base quality)
12+ OPT variable OPTional fields in the format TAG:VTYPE:VALUE
Each bit in the FLAG field is defined as:
1 p the read is paired in sequencing
2 P the read is mapped in a proper pair
4 u the query sequence itself is unmapped
8 U the mate is unmapped
16 r strand of the query (1 for reverse)
32 R strand of the mate
64 1 the read is the first read in a pair
128 2 the read is the second read in a pair
256 s the alignment is not primary
512 f the read fails platform/vendor quality checks
1024 d the read is either a PCR or an optical duplicate
2048 S the alignment is supplementary
The FLAG column is a number
It is the sum of numbers on the previous table
For example
SRR507778.1 0 ...
SRR507778.2 16 ....
The first read is not paired, and aligned in the Forward strand
The second read is not paired, and aligned in the Reverse strand
Most of the flags only make sense when we combine forward and reverse read files
bwa mem yeast.fasta y1.fastq y2.fastq > out2.sam
Each read has its “pair”
fragment ========================================
SE read --------->
PE reads R1---------> <---------R2
unknown gap ....................
The two FASTQ files need to have the same number of reads
The first read of one file matches the first read of the other file
The n-th read of one file matches the n-th read of the other file
Basically
Good ----> <----
bad <---- ---->
bad ----> ---->