Biologist's bioinformatics notes

GATK, provided by Broad institute, has become defacto standard for calling variants in WES, WGS and targeted/panel based sequencing data for many labs across the world. In addition, GATK team provided best practices workflows

To help novice analysts
To introduce workflows routinely used at ivory towers of research to rest of the world
To standardize workflows across the labs (thus countries) to get reproducible results

Till recent times, GATK team provided workflows only DNAseq data, but not for RNAseq. However, GATK team came up best practices workflow for calling variants using RNAseq as outlined below:

(Based on workflow outlined here)

In this note, following steps are executed and below the steps, a flow diagram is shown.

Quality check by fastqc
Trimming by cutadapt
Alignment by tophat
Alignment statistics collection by samtools
Sorting aligned file by samtools
Index the aligned file by samtools
Deduplicating aligned file by picard
Add Read Groups (RG) tag by picard
Splitting N cigar reads by GATK
Indel Recalibration (Creating intervals and Indel realignment with GATK)
Recalibration files (First pass and Second pass with GATK)
Quality check for recalibration (before and after plots by GATK)
Base quality score recalibration (BQSR)
Variant calling
Variant filtering

Add caption

RNA seq data for this note is downloaded from here and below is the bash script for running the pipeline: ( Please note that following additional steps are also included in the pipeline, but they are commented out. Do not mix variant calling (VCF) with genomic variant calling (gVCF))

Discordant and concordant variant filtering
gVCF calling
merging gVCFs

User should have raw_reads directory in the current directory with all the fastq files and knows the reference path. Replace reference paths with appropriate

######################

####bash script starts######

#####################

for i in $(ls ./raw_reads | grep ^[^d]| rev | cut -c 10- | rev | uniq)
do

# Make direcotry for fastqc
mkdir -p ${i%}_fastqc

# Run fastqc
fastqc -o ./${i%}_fastqc -f fastq /home/user/rnaseq/raw_reads/${i%}_R1.fastq /home/user/rnaseq/raw_reads/${i%}_R2.fastq

# Make directory to store cutadapt results
mkdir -p cutadapt

#Run cutadapt
cutadapt --quality-cutoff=20 --format=fastq -o ./cutadapt/${i%}_R1.cutadapt.fastq -p ./cutadapt/${i%}_R2.cutadapt.fastq /home/user/rnaseq/raw_reads/${i%}_R1.fastq /home/user/rnaseq/raw_reads/${i%}_R2.fastq

# Run top hat and save the out with directory extension _tophatout
tophat2 --no-coverage-search -o ./${i%}_tophat_out -G ./reference/genes_chr12.gtf -p 2 ./reference/chr12 ./cutadapt/${i%}_R1.cutadapt.fastq ./cutadapt/${i%}_R2.cutadapt.fastq

# Collect the stats
samtools flagstat ./${i%}_tophat_out/accepted_hits.bam > ./${i%}_tophat_out/accepted_hits.flagstats

#Sort bam
samtools sort -T /tmp/align.bam ./${i%}_tophat_out/accepted_hits.bam -o ./${i%}_tophat_out/accepted_hits.pos.sorted.bam

# Index the bam
samtools index ./${i%}_tophat_out/accepted_hits.pos.sorted.bam

# picard mkdir
mkdir -p ${i%}_picard

# Dedup bam
java -jar /opt/picard-tools-1.119/MarkDuplicates.jar \
   METRICS_FILE=${i%}_picard/${i%}.dedup.metrics \
   REMOVE_DUPLICATES=true ASSUME_SORTED=true \
   VALIDATION_STRINGENCY=LENIENT \
   CREATE_INDEX=true \
   I=./${i%}_tophat_out/accepted_hits.pos.sorted.bam \
   O=./${i%}_picard/q20.cutadapt.sorted.dedup.bam

# Add read groups
java -jar /opt/picard-tools-1.119/AddOrReplaceReadGroups.jar \
   I=./${i%}_picard/q20.cutadapt.sorted.dedup.bam \
   O=./${i%}_picard/q20.cutadapt.sorted.dedup.rg.bam \
   CREATE_INDEX=true SO=coordinate RGID=${i%%_*} RGLB=${i%} RGPL=ILLUMINA RGSM=${i%} RGPU=GCCAAT

# mkdir gatk
mkdir -p ./${i%}_gatk

# Split N cigar reads
java -jar /opt/gatk-3.4-46/GenomeAnalysisTK.jar \
   -T SplitNCigarReads \
   -R /home/user/rnaseq/reference/chr12.fa \
   -I ./${i%}_picard/q20.cutadapt.sorted.dedup.rg.bam \
   -o ./${i%}_gatk/q20.cutadapt.sorted.dedup.snc.bam \
   -U ALLOW_N_CIGAR_READS

#Sort bam
samtools sort -T /tmp/align.bam ./${i%}_gatk/q20.cutadapt.sorted.dedup.snc.bam -o ./${i%}_gatk/q20.cutadapt.sorted.dedup.snc.sorted.bam

# Index the bam
samtools index ./${i%}_gatk/q20.cutadapt.sorted.dedup.snc.sorted.bam

# Create intervals
java -jar /opt/gatk-3.4-46/GenomeAnalysisTK.jar \
   -T RealignerTargetCreator \
   -R /home/user/rnaseq/reference/chr12.fa \
   -I ./${i%}_gatk/q20.cutadapt.sorted.dedup.snc.sorted.bam \
   --known /home/user/ngs_exercise/data/hg19/Mills_and_1000G_gold_standard.indels.hg19.chr12.vcf \
   -L chr12 -o ./${i%}_gatk/forIndelRealigner.intervals

# Realign bam files
java -jar /opt/gatk-3.4-46/GenomeAnalysisTK.jar \
   -T IndelRealigner \
   -R /home/user/rnaseq/reference/chr12.fa \
   -I ./${i%}_gatk/q20.cutadapt.sorted.dedup.snc.sorted.bam \
   -known /home/user/ngs_exercise/data/hg19/Mills_and_1000G_gold_standard.indels.hg19.chr12.vcf \
   -L chr12 \
   -targetIntervals ./${i%}_gatk/forIndelRealigner.intervals \
   -o ./${i%}_gatk/q20.cutadapt.sorted.dedup.snc.sorted.realigned.bam

# Base recalibration First pass
java -jar /opt/gatk-3.4-46/GenomeAnalysisTK.jar \
   -T BaseRecalibrator \
   -R /home/user/rnaseq/reference/chr12.fa \
   -I ./${i%}_gatk/q20.cutadapt.sorted.dedup.snc.sorted.realigned.bam \
   -knownSites /home/user/rnaseq/reference/chr12.dbsnp.b141.b37.hg19.tidy.vcf \
   -o ./${i%}_gatk/recal_data.table

# Base recalibration second pass
java -jar /opt/gatk-3.4-46/GenomeAnalysisTK.jar \
   -T BaseRecalibrator \
   -R /home/user/rnaseq/reference/chr12.fa \
   -I ./${i%}_gatk/q20.cutadapt.sorted.dedup.snc.sorted.realigned.bam \
   -knownSites /home/user/rnaseq/reference/chr12.dbsnp.b141.b37.hg19.tidy.vcf \
   -L chr12 \
   -knownSites /home/user/ngs_exercise/data/hg19/Mills_and_1000G_gold_standard.indels.hg19.chr12.vcf \
   -BQSR ./${i%}_gatk/recal_data.table \
   -o ./${i%}_gatk/post_recal_data.table

# Create plots
java -jar /opt/gatk-3.4-46/GenomeAnalysisTK.jar \
   -T AnalyzeCovariates \
   -R /home/user/rnaseq/reference/chr12.fa \
   -L chr12 \
   -before ./${i%}_gatk/recal_data.table \
   -after ./${i%}_gatk/post_recal_data.table \
   -plots ./${i%}_gatk/recalibration_plots.pdf

# Apply BQSR to sequence
java -jar /opt/gatk-3.4-46/GenomeAnalysisTK.jar \
   -T PrintReads \
   -R /home/user/rnaseq/reference/chr12.fa \
   -I ./${i%}_gatk/q20.cutadapt.sorted.dedup.snc.sorted.realigned.bam \
   -L chr12 \
   -BQSR ./${i%}_gatk/recal_data.table \
   -o ./${i%}_gatk/q20.cutadapt.sorted.dedup.snc.sorted.realigned.recal.bam

# Variant calling
java -jar /opt/gatk-3.4-46/GenomeAnalysisTK.jar \
   -T HaplotypeCaller \
   -R /home/user/rnaseq/reference/chr12.fa \
   -I ./MeOH_REP1_gatk/q20.cutadapt.sorted.dedup.snc.sorted.realigned.recal.bam \
   -I ./MeOH_REP2_gatk/q20.cutadapt.sorted.dedup.snc.sorted.realigned.recal.bam \
   -I ./MeOH_REP3_gatk/q20.cutadapt.sorted.dedup.snc.sorted.realigned.recal.bam \
   --dbsnp /home/user/rnaseq/reference/chr12.dbsnp.b141.b37.hg19.tidy.vcf \
   -dontUseSoftClippedBases \
   -stand_call_conf 20 \
   -stand_emit_conf 20 \
   -o ./meoh_output.raw.snps.indels.vcf

java -jar /opt/gatk-3.4-46/GenomeAnalysisTK.jar \
   -T HaplotypeCaller \
   -R /home/user/rnaseq/reference/chr12.fa \
   -I ./R3G_REP1_gatk/q20.cutadapt.sorted.dedup.snc.sorted.realigned.recal.bam \
   -I ./R3G_REP2_gatk/q20.cutadapt.sorted.dedup.snc.sorted.realigned.recal.bam \
   -I ./R3G_REP3_gatk/q20.cutadapt.sorted.dedup.snc.sorted.realigned.recal.bam \
   --dbsnp /home/user/rnaseq/reference/chr12.dbsnp.b141.b37.hg19.tidy.vcf \
   -dontUseSoftClippedBases \
   -stand_call_conf 20 \
   -stand_emit_conf 20 \
   -o ./r3g_output.raw.snps.indels.vcf

# Discordant calls
#java -jar /opt/gatk-3.4-46/GenomeAnalysisTK.jar \
#   -T SelectVariants \
#   -R /home/user/rnaseq/reference/chr12.fa \
# -V ./meoh_output.raw.snps.indels.vcf \
#   --discordance ./r3g_output.raw.snps.indels.vcf \
#   -o doutput.vcf \

# Concordant calls
#java -jar /opt/gatk-3.4-46/GenomeAnalysisTK.jar \
#   -T SelectVariants \
#   -R /home/user/rnaseq/reference/chr12.fa \
#   -V ./meoh_output.raw.snps.indels.vcf \
#   --concordance ./r3g_output.raw.snps.indels.vcf \
#   -o coutput.vcf \

# Variant filtering
java -jar /opt/gatk-3.4-46/GenomeAnalysisTK.jar \
   -T VariantFiltration \
   -R /home/user/rnaseq/reference/chr12.fa \
   -V ./${i%}_gatk/output.raw.snps.indels.vcf \
   -window 35 \
   -cluster 3 \
   -filterName FS \
   -filter "FS > 30.0" \
   -filterName QD \
   -filter "QD < 2.0" \
   -o ./${i%}_gatk/output.raw.snps.indels.filtered.vcf

#GenotypeGVCFs MeOH
#java -jar /opt/gatk-3.4-46/GenomeAnalysisTK.jar \
#   -T GenotypeGVCFs \
#   -R /home/user/rnaseq/reference/chr12.fa \
#   -V ./MeOH_REP1_gatk/output.raw.snps.indels.g.vcf \
#   -V ./MeOH_REP2_gatk/output.raw.snps.indels.g.vcf \
#   -V ./MeOH_REP3_gatk/output.raw.snps.indels.g.vcf \
#   -o meoh_output.vcf

#GenotypeGVCFs R3G
#java -jar /opt/gatk-3.4-46/GenomeAnalysisTK.jar \
#   -T GenotypeGVCFs \
#   -R /home/user/rnaseq/reference/chr12.fa \
#   -V ./R3G_REP1_gatk/output.raw.snps.indels.g.vcf \
#   -V ./R3G_REP2_gatk/output.raw.snps.indels.g.vcf \
#   -V ./R3G_REP3_gatk/output.raw.snps.indels.g.vcf \
#   -o r3g_output.vcf

#GenotypeGVCFs combine R3G Meoh
#java -jar /opt/gatk-3.4-46/GenomeAnalysisTK.jar \
#   -T CombineGVCFs \
#   -R /home/user/rnaseq/reference/chr12.fa \
#   -V r3g_output.vcf \
#   -V meoh_output.vcf \
#   -o meoh_r3g_output.vcf

done

##############################

###bash script ends##############

####################

Recent Posts

Links

Oct 10, 2015 - RNA seq variant calling best practice pipeline