October 12, 2018

Talk given by Kaan

An application of bioinformatics in clinical context.

Determination of polymorphisms in genes associated with known diseases

A specific gene is amplified using PCR. The product is sequenced using NGS.

Bioinformatic analysis starts with the fastq files, and the human genome as reference

The first step determines the position of each read in the genome. That is, it takes the read sequence and looks for an identical (or near) subsequence on the genome

The tool typically used for this is bwa. This tool uses the TK: B-Wheeler algorithm to find the places in the genome where there are subsequences that match each of the reads.

To speed up the search, bwa uses an index of the reference genome. We will discuss later how to make these indices and what do they mean.

reads are compressed using gzip. This is not the same as zip. Gzip compresses a single file. Zip compresses several files, and then it archives them together.

bwa mem -­M \
-­R "@RG\tID:sample1\tSM:sample1\tPL:illumina\tLB:sample1\tPU:1" \
    reference/ninespine.fa \
    data/sample­1_1.fq.gz \
    data/sample­1_2.fq.gz > data/sample­1_bwa.sam

samtools view -­F4 ­-h ­-Obam \
    -­o data/sample­1_bwa.bam data/sample­1_bwa.sam

samtools view data/sample­1_bwa.bam | less -­S

samtools sort ­T /tmp/sample­1 ­-O bam ­\
    -o data/sample­1_bwa_sorted.bam \
    data/sample­1_bwa.bam

samtools rmdup data/sample­1_bwa_sorted.bam \
    data/sample­1_bwa_rmdup.bam

samtools index data/sample­1_bwa_rmdup.bam \
    data/sample­1_bwa_rmdup.bai

GATK

GATK -T RealignerTargetCreator -R hg19_reference.fa \
    -I bwa_rmdup.bam -o 8246_target_intervals.list

GATK ­-T IndelRealigner -R hg19_reference.fa \
    ­-I data/sample­1_bwa_rmdup.bam \
    ­-targetIntervals data/sample­1_intervals.list \
    -­o data/sample­1_realn.bam

GATK -T HaplotypeCaller -R hg19_reference.fa \
    -I realn_8246.bam \
    -variant_index_type LINEAR \
    -variant_index_parameter 128000 \
    -gt_mode DISCOVERY -ERC GVCF \
    -stand_call_conf 10 \
    -GQB 10 -GQB 20 -GQB 30 -GQB 40 -GQB 50 \
    -o calls.gvcf \
    | gzip -s - > calls.gvcf.gz

tabix -p vcf calls.gvcf.gz

GATK -T GenotypeGVCFs -R hg19_reference.fa \
    --variant calls.gvcf.gz -o calls.vcf \
    | bcftools view -Oz -o calls.vcf.gz &

GATK -T SelectVariants -R hg19_reference.fa \
    --variant calls.vcf.gz -select "QUAL > 20.0" \
    -select "QD > 10.0" -select "MQ > 25.0" \
    -o filtered.vcf

java -Xmx4g -jar snpEff.jar hg19 \
    HC.qual500.filtered.vcf > filtered.ann.vcf

GATK -T HaplotypeCaller -R hg19_reference.fa \
    -I realn_8246.bam -o ra-variant_8246.vcf