October 19, 2018

What is the question?

We have reads that are close related to a reference genome

For example, RNA messenger from the same individual

Or DNA from a new individual from the same species

Or DNA from a species on the same genus

Why we want to do this?

The answer depends on the case

  • For RNA messengers, this allows us to measure gene expression
  • For a new individual, this allows us to identify polymorphisms
    • They may explain some genetic traits, such as mendelian diseases
  • For a new species of the same genus, this can be used instead of de novo assembly

What tool can we use align reads to a genome?

There are several tool for the same goal

Two tools that are popular today are:

  • bowtie
  • bwa

They have a similar philosophy

can you find others?

Let’s see an example

We will use bwa today. You can try with bowtie and the same data

Let’s connect to the server and change to our working directory using cd bioinfo

We start with the genome of S. cerevisiae, which is on the server at /home/anaraven/2018/bioinfo/Saccharomyces_cerevisiae.R64-1-1.dna_sm.toplevel.fa.gz

This is a compressed file, as we can realize by the .gz extension

Shall we decompress the file?

Since NGS produce huge amounts of data, the files are usually compressed

Many programs, such as bwa can read files either compressed or not

But other tools cannot, so we will decompress the file

That is bad because we use more time and memory

Check every case if you can avoid decompressing

How do we decompress the file?

The original file is in read-only mode, so we cannot replace it

Instead each one will have an uncompressed copy

Type this in one line (using [TAB] to save time)

zcat /home/anaraven/2018/bioinfo/Saccharomyces_
cerevisiae.R64-1-1.dna_sm.toplevel.fa.gz > yeast.fasta

Verifying the decompression

The first way to verify is to take a look using

less yeast.fasta

Since the file is big, we cannot read it all. Instead we use the computer

grep '>' yeast.fasta

and

wc yeast.fasta

What about the reads?

For today’s exercise we will use two files

/home/anaraven/2018/bioinfo/y1.fastq
/home/anaraven/2018/bioinfo/y2.fastq

You can copy them to your folder

cp /home/anaraven/2018/bioinfo/y?.fastq .

Better: link them to your folder, so we use less memory

ln -s /home/anaraven/2018/bioinfo/y?.fastq .

Why we have two files?

Take a look at the first lines of each fastq file

You can see that the sequences are different, but they have the same ID. Why?

These are paired ends. There are two reads for each fragment.

So, can we align them now?

As we discussed previously, these programs use an index to speedup the search

So the first step is

bwa index yeast.fasta

List the files in your folder and verify that there are new files. You can take a look at them

Output

$ bwa index yeast.fasta
[bwa_index] Pack FASTA... 0.18 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 7.65 seconds elapse.
[bwa_index] Update BWT... 0.08 sec
[bwa_index] Pack forward-only FASTA... 0.06 sec
[bwa_index] Construct SA from BWT and Occ... 2.26 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index yeast.fasta
[main] Real time: 10.400 sec; CPU: 10.244 sec