7 GATK pre-processing

This pipeline contains function to call variants with GATK3 MuTect2, Samtools + VarScan2, VarDict and Freebayes. In order to run MuTect2 some GATK pre-processing are needed. The function ./functions/gatk_process_pipe.R will perform the following steps:

Which are suggested in the GATK best practices for RNA-Seq variant calling.

Below is an example call which wraps the steps above and check if files have already been created.

Rscript ./functions/gatk_process_pipe.R \
--reference_fasta path_to_genome.fa \
--bamfile ./results/star_2pass/SRR1608907Aligned.reorderedDupl.rg.bam \
--sampleName SRX381851 \
-knownSites path_to_GATK_Bundle_files/dbsnp_138.hg19.excluding_sites_after_129.vcf \
-knownSites path_to_GATK_Bundle_files/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \
-knownSites path_to_GATK_Bundle_files/1000G_phase1.indels.hg19.sites.vcf 

The function above is a wrapper for the following GATK3 calls.

7.1 SplitNCigarReads

gatk -T SplitNCigarReads -R path_to_genome.fa \
-I ./results/star_2pass/SRR1608907Aligned.reorderedDupl.rg.bam \
-o ./results/star_2pass/SRR1608907Aligned.reorderedDupl.rg.split.bam \
--filter_mismatching_base_and_quals -U ALLOW_N_CIGAR_READS -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 \
--log_to_file ./results/star_2pass/SRR1608907_RG_DUPL_SPLIT_log

7.2 Base recalibration

Base recalibration using known sites downloaded from the GATK Bundle. More information about base recalibration can be found on GATK website.

module load gatk/3.7.0

gatk -T BaseRecalibrator -R path_to_genome.fa \
-I ./results/star_2pass/SRR1608907Aligned.reorderedDupl.rg.split.bam -nct 8 \
-knownSites path_to_GATK_Bundle_files/dbsnp_138.hg19.excluding_sites_after_129.vcf \
-knownSites path_to_GATK_Bundle_files/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \
-knownSites path_to_GATK_Bundle_files/1000G_phase1.indels.hg19.sites.vcf \
-o ./results/star_2pass/BaseQRecal/SRR1608907/SRR1608907_recal_data.table \
--log_to_file ./results/star_2pass/BaseQRecal/SRR1608907/SRR1608907_recal_step1_log 

gatk -T BaseRecalibrator -R path_hg19_reference/genome.fa \
-I ./results/star_2pass/SRR1608907Aligned.reorderedDupl.rg.split.bam -nct 8 \
-knownSites path_to_GATK_Bundle_files/dbsnp_138.hg19.excluding_sites_after_129.vcf \
-knownSites path_to_GATK_Bundle_files/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \
-knownSites path_to_GATK_Bundle_files/1000G_phase1.indels.hg19.sites.vcf \
-BQSR ./results/star_2pass/BaseQRecal/SRR1608907/SampleName_recal_data.table \
-o ./results/star_2pass/BaseQRecal/SRR1608907/SRR1608907_post_recal_data.table \
--log_to_file ./results/star_2pass/BaseQRecal/SRR1608907/SRR1608907_recal_step2_log 

gatk -T AnalyzeCovariates -R path_hg19_reference/genome.fa \
-before ./results/star_2pass/BaseQRecal/SRR1608907/SSRR1608907_recal_data.table \
-after ./results/star_2pass/BaseQRecal/SRR1608907/SRR1608907_post_recal_data.table \
-csv ./results/star_2pass/BaseQRecal/SRR1608907/SSRR1608907_recalibration_plots.csv \
-plots ./results/star_2pass/BaseQRecal/SRR1608907/SRR1608907_recalibration_plots.pdf \
--log_to_file ./results/star_2pass/BaseQRecal/SRR1608907/SRR1608907_recal_analyseCov_log 

gatk -T PrintReads -R path_hg19_reference/genome.fa \
-I ./results/star_2pass/SRR1608907Aligned.reorderedDupl.rg.split.bam \
-o ./results/star_2pass/SRR1608907Recal.reorderedDupl.rg.split.bam \
-nct 8 -BQSR ./results/star_2pass/BaseQRecal/SRR1608907/SRR1608907_post_recal_data.table \
--log_to_file ./results/star_2pass/BaseQRecal/SRR1608907/SRR1608907_Log_recalibrated_bases

7.3 Summary with MultiQC

MultiQC also support output from GATK https://multiqc.info/docs/#gatk and you can easily create a report for the summary/log files created above.