Class 15: Mapping reads to a reference

Bioinformatics

Andrés Aravena

January 15, 2021

What is the question?

We have reads that are close related to a reference genome

For example, RNA messengers

Or DNA from a new individual from the same species

Or DNA from a species on the same genus

Or reads that we want to map to an assembled genome

Why we want to do this?

The answer depends on the case

  • With RNA messengers we measure gene expression
  • With DNA frin a new individual, we identify polymorphisms
    • They may explain some genetic traits
  • For a new species of the same genus, this can be used instead of de novo assembly
  • For a genome assembled with De Bruijn graphs, we can find the contigs of each read

What tool can we use align reads to a genome?

There are several tool for the same goal

Two tools that are popular today are:

  • bwa: Burrows-Wheeler Alignment Tool
  • bowtie and bowtie2

They have a similar philosophy

Can you find others?

Using bwa

First, it makes an index of the genome

bwa index ref.fa

Then it aligns the reads to the genome

bwa mem ref.fa reads.fq > aln-se.sam

We use the extension .sam. These are large text files

Sometimes .sam files are encoded in smaller .bam binary format files

What is inside the .sam file?

@SQ     SN:XI   LN:666816
@SQ     SN:IX   LN:439888
@SQ     SN:IV   LN:1531933
@SQ     SN:III  LN:316620
@SQ     SN:X    LN:745751
@SQ     SN:XII  LN:1078177
@SQ     SN:VIII LN:562643
@SQ     SN:VI   LN:270161
@SQ     SN:V    LN:576874
@SQ     SN:II   LN:813184
@SQ     SN:XIV  LN:784333
@SQ     SN:XIII LN:924431
@SQ     SN:XVI  LN:948066
@SQ     SN:XV   LN:1091291
@SQ     SN:I    LN:230218
@SQ     SN:Mito LN:85779
@SQ     SN:VII  LN:1090940
@PG     ID:bwa  PN:bwa  VN:0.7.17-r1188 CL:bwa mem yeast.fasta y1.fastq -o out1.sam
SRR507778.1     0       Mito    20158   60      36M     *       0       0       ANTATAATATTATCCCCACGAGGGCCACACATGTGT    ?#?;<BBB?BBGGEGGGGDDE8EEDB:?<=?BB?B7    NM:i:1  MD:Z:1A34       AS:i:34 XS:i:0
SRR507778.2     16      XV      767197  60      36M     *       0       0       ACCCATTTCATAATTAATAAGTGGTATATCAGGANG    <?DGEDBHHHHHHHDHFGFHFGFGHG@@A=@;;7#:    NM:i:1  MD:Z:34C1       AS:i:34 XS:i:0
SRR507778.3     0       II      569438  60      36M     *       0       0       GNAAGAACCCTTATACTAACAGCAGCATAGGAAAGT    ;#>45<4454GEDGGGGGGECEACBBBB=FGADAG8    NM:i:1  MD:Z:1C34       AS:i:34 XS:i:0

What does this mean?

Lines starting with @ are metadata

They describe the conditions of the alignment

Line like this

@SQ     SN:XI   LN:666816
@SQ     SN:IX   LN:439888
@SQ     SN:VII  LN:1090940

describe the sequence

What does this mean?

Line like this

@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem yeast.fasta
 y1.fastq -o out1.sam

describe how the program was executed. It includes

  • ID: id of the command
  • PN: program name
  • VN: version number
  • CL: command line

What is the alignment data?

Many lines like this:

SRR507778.1     0       Mito    20158   60      36M     *       0       0
        ANTATAATATTATCCCCACGAGGGCCACACATGTGT
        ?#?;<BBB?BBGGEGGGGDDE8EEDB:?<=?BB?B7
        NM:i:1  MD:Z:1A34       AS:i:34 XS:i:0
SRR507778.2     16      XV      767197  60      36M     *       0       0
        ACCCATTTCATAATTAATAAGTGGTATATCAGGANG
        <?DGEDBHHHHHHHDHFGFHFGFGHG@@A=@;;7#:
        NM:i:1  MD:Z:34C1       AS:i:34 XS:i:0
SRR507778.3     0       II      569438  60      36M     *       0       0
        GNAAGAACCCTTATACTAACAGCAGCATAGGAAAGT
        ;#>45<4454GEDGGGGGGECEACBBBB=FGADAG8
        NM:i:1  MD:Z:1C34       AS:i:34 XS:i:0

What is the alignment data?

Sequence Alignment/Map (SAM) format is TAB-delimited.

1    QNAME   Query template/pair NAME
2    FLAG    bitwise FLAG
3    RNAME   Reference sequence NAME
4    POS     1-based leftmost POSition/coordinate of clipped sequence
5    MAPQ    MAPping Quality (Phred-scaled)
6    CIGAR   extended CIGAR string
7    MRNM    Mate Reference sequence NaMe (`=' if same as RNAME)
8    MPOS    1-based Mate POSistion
9    TLEN    inferred Template LENgth (insert size)
10   SEQ     query SEQuence on the same strand as the reference
11   QUAL    query QUALity (ASCII-33 gives the Phred base quality)
12+  OPT     variable OPTional fields in the format TAG:VTYPE:VALUE

What is the flag?

Each bit in the FLAG field is defined as:

1    p   the read is paired in sequencing
2    P   the read is mapped in a proper pair
4    u   the query sequence itself is unmapped
8    U   the mate is unmapped
16   r   strand of the query (1 for reverse)
32   R   strand of the mate
64   1   the read is the first read in a pair
128  2   the read is the second read in a pair
256  s   the alignment is not primary
512  f   the read fails platform/vendor quality checks
1024 d   the read is either a PCR or an optical duplicate
2048 S   the alignment is supplementary

What is a bit flag?

The FLAG column is a number

It is the sum of numbers on the previous table

For example

SRR507778.1     0       ...
SRR507778.2     16      ....

The first read is not paired, and aligned in the Forward strand

The second read is not paired, and aligned in the Reverse strand

Why there are only 0 and 16 ?

Most of the flags only make sense when we combine forward and reverse read files

bwa mem yeast.fasta y1.fastq y2.fastq > out2.sam

Now FLAGS is more interesting

99 = 64+32+2+1
  • 1 : the read is paired in sequencing
  • 2 : the read is mapped in a proper pair
  • 32 : strand of the mate
  • 64 : the read is the first read in a pair
147 = 128+16+2+1
  • 1 : the read is paired in sequencing
  • 2 : the read is mapped in a proper pair
  • 16 : reverse strand
  • 128 : the read is the second read in a pair

What do we mean by “Paired”?

Each read has its “pair”

fragment      ========================================
SE read       --------->
PE reads    R1--------->                    <---------R2
unknown gap             ....................
SE
Single end read
PE
Paired end read

How does BWA know which reads are part of a pair?

The two FASTQ files need to have the same number of reads

The first read of one file matches the first read of the other file

The n-th read of one file matches the n-th read of the other file

What do we mean by “proper paired”?

Basically

  • Both pairs are in the same chromosome
  • both pairs are in opposite strands
  • both pairs are facing each other
Good    ---->      <----
bad     <----      ---->
bad     ---->      ---->

Now FLAGS is more interesting

83 = 64+16+2+1
  • 1 : the read is paired in sequencing
  • 2 : the read is mapped in a proper pair
  • 16 : reverse strand
  • 64 : the read is the first read in a pair
163 = 128+32+2+1
  • 1 : the read is paired in sequencing
  • 2 : the read is mapped in a proper pair
  • 32 : strand of the mate
  • 128 : the read is the second read in a pair