Class 23: Understanding SAM files

Bioinformatics

Andrés Aravena

December 30, 2021

We can align reads to a genome

There are several tools used to map reads to a genome

bwa
bowtie
bowtie2
hisat
hisat2

Each one has several options, that you can read in their manual

Why we want to do this?

With RNA messengers we measure gene expression
With DNA from a new individual, we identify polymorphisms
- They may explain some genetic traits
For a new species of the same genus, this can be used instead of de novo assembly
For a genome assembled with De Bruijn graphs, we can find the contigs of each read

Output format

All these tools produce the same kind of files

Sequence Alignment/Mapping (SAM) files are text files

They are big. Sometimes they are encoded in binary

BAM files are Binary SAM files

Humans and all programs can read text files, only some programs can read binary files

What is the alignment data?

Many lines like this:

SRR507778.1     0       Mito    20158   60      36M     *       0       0
        ANTATAATATTATCCCCACGAGGGCCACACATGTGT
        ?#?;<BBB?BBGGEGGGGDDE8EEDB:?<=?BB?B7
        NM:i:1  MD:Z:1A34       AS:i:34 XS:i:0
SRR507778.2     16      XV      767197  60      36M     *       0       0
        ACCCATTTCATAATTAATAAGTGGTATATCAGGANG
        <?DGEDBHHHHHHHDHFGFHFGFGHG@@A=@;;7#:
        NM:i:1  MD:Z:34C1       AS:i:34 XS:i:0
SRR507778.3     0       II      569438  60      36M     *       0       0
        GNAAGAACCCTTATACTAACAGCAGCATAGGAAAGT
        ;#>45<4454GEDGGGGGGECEACBBBB=FGADAG8
        NM:i:1  MD:Z:1C34       AS:i:34 XS:i:0

(these are three long lines, written in several lines)

What is the alignment data?

Each column in Sequence Alignment/Map (SAM) has a different meaning

Column	Name	Description
1	QNAME	Query template/pair NAME
2	FLAG	bitwise FLAG
3	RNAME	Reference sequence NAME
4	POS	1-based leftmost POSition/coordinate of clipped sequence
5	MAPQ	MAPping Quality (Phred-scaled)
6	CIGAR	extended CIGAR string
7	MRNM	Mate Reference sequence NaMe (`=` if same as RNAME)
8	MPOS	1-based Mate POSition
9	TLEN	inferred Template LENgth (insert size)
10	SEQ	query SEQuence on the same strand as the reference
11	QUAL	query QUALity (ASCII-33 gives the Phred base quality)
12+	OPT	variable OPTional fields in the format `TAG:VTYPE:VALUE`

What is the flag?

Each bit in the FLAG field is defined as:

Value	Code	Meaning
1	p	the read is paired in sequencing
2	P	the read is mapped in a proper pair
4	u	the query sequence itself is unmapped
8	U	the mate is unmapped
16	r	strand of the query (1 for reverse)
32	R	strand of the mate
64	1	the read is the first read in a pair
128	2	the read is the second read in a pair
256	s	the alignment is not primary
512	f	the read fails platform/vendor quality checks
1024	d	the read is either a PCR or an optical duplicate
2048	S	the alignment is supplementary

What is a bit flag?

The FLAG column is a number

It is the sum of numbers on the previous table

For example

SRR507778.1     0       ...
SRR507778.2     16      ...

The first read is not paired, and aligned in the Forward strand

The second read is not paired, and aligned in the Reverse strand

Why there are only 0 and 16 ?

Most of the flags only make sense when we combine forward and reverse read files

bwa mem yeast.fasta y1.fastq y2.fastq > out2.sam

Now FLAGS is more interesting

99 = 64+32+2+1

1 : the read is paired in sequencing
2 : the read is mapped in a proper pair
32 : strand of the mate
64 : the read is the first read in a pair

147 = 128+16+2+1

1 : the read is paired in sequencing
2 : the read is mapped in a proper pair
16 : reverse strand
128 : the read is the second read in a pair

What do we mean by “Paired”?

Each read has its “pair”

fragment      ========================================
SE read       --------->
PE reads    R1--------->                    <---------R2
unknown gap             ....................

SE: Single end read
PE: Paired end read

How to know which reads are paired?

The two FASTQ files need to have the same number of reads

The first read of one file matches the first read of the other file

The n-th read of one file matches the n-th read of the other file

What do we mean by “proper paired”?

Basically

Both pairs are in the same chromosome
both pairs are in opposite strands
both pairs are facing each other

Good    ---->      <----
bad     <----      ---->
bad     ---->      ---->

Now FLAGS is more interesting

83 = 64+16+2+1

1 : the read is paired in sequencing
2 : the read is mapped in a proper pair
16 : reverse strand
64 : the read is the first read in a pair

163 = 128+32+2+1

1 : the read is paired in sequencing
2 : the read is mapped in a proper pair
32 : strand of the mate
128 : the read is the second read in a pair

CIGAR

The column called CIGAR (Concise Idiosyncratic Gapped Alignment Report) describes how the read mapped to the genome

In other words, it shows which operations are necessary to map the read to the reference

For full details see

Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, Richard Durbin, 1000 Genome Project Data Processing Subgroup, “The Sequence Alignment/Map format and SAMtools”, Bioinformatics, Volume 25, Issue 16, 15 August 2009, Pages 2078–2079, https://doi.org/10.1093/bioinformatics/btp352

Operations encoded in CIGAR format

M	Alignment (can be a sequence match or mismatch!)
I	Insertion in the read compared to the reference
D	Deletion in the read compared to the reference
N	Skipped region from the reference. An intron (in mRNA)
S	Soft clipping (clipped sequences are present in read)
H	Hard clipping (clipped sequences are NOT present in the alignment record);
P	Padding (silent deletion from padded reference) = Sequence match
X	Sequence mismatch (not widely used)

Example

Interpretation

r001 is the name of a read pair. It has FLAG 163 (=1 + 2 + 32 + 128)

It mapped to position 7 is the second read in the pair (128) and regarded as properly paired (1 + 2);
its mate is mapped to 37 on the reverse strand (32).

Read r002 has three soft-clipped (unaligned) bases

The coordinate shown in SAM is the position of the first aligned base.
The CIGAR string for this alignment contains a P (padding) operation which correctly aligns the inserted sequences.

Interpretation (cont)

The last six bases of read r003 map to position 9, and the first five to position 29 on the reverse strand.

The hard clipping operation H indicates that the clipped sequence is not present in the sequence field. The NM tag gives the number of mismatches.

Read r004 is aligned across an intron, indicated by the N operation.

Another example

What do you see here?

How to do your own alignments?

Use Galaxy

If you have a good super-computer, and you know how to use it, then you can do it easily

For small projects it is better to use galaxy

https://usegalaxy.org/