7 GATK pre-processing
This pipeline contains function to call variants with GATK3 MuTect2
, Samtools + VarScan2
, VarDict
and Freebayes
.
In order to run MuTect2
some GATK pre-processing are needed. The function ./functions/gatk_process_pipe.R
will perform the following steps:
- SplitNCigarReads see GATK documentation
- Base recalibration see GATK documentation.
Which are suggested in the GATK best practices for RNA-Seq variant calling
.
Below is an example call which wraps the steps above and check if files have already been created.
Rscript ./functions/gatk_process_pipe.R \
--reference_fasta path_to_genome.fa \
--bamfile ./results/star_2pass/SRR1608907Aligned.reorderedDupl.rg.bam \
--sampleName SRX381851 \
-knownSites path_to_GATK_Bundle_files/dbsnp_138.hg19.excluding_sites_after_129.vcf \
-knownSites path_to_GATK_Bundle_files/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \
-knownSites path_to_GATK_Bundle_files/1000G_phase1.indels.hg19.sites.vcf
The function above is a wrapper for the following GATK3
calls.
7.1 SplitNCigarReads
gatk -T SplitNCigarReads -R path_to_genome.fa \
-I ./results/star_2pass/SRR1608907Aligned.reorderedDupl.rg.bam \
-o ./results/star_2pass/SRR1608907Aligned.reorderedDupl.rg.split.bam \
--filter_mismatching_base_and_quals -U ALLOW_N_CIGAR_READS -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 \
--log_to_file ./results/star_2pass/SRR1608907_RG_DUPL_SPLIT_log
7.2 Base recalibration
Base recalibration using known sites downloaded from the GATK Bundle. More information about base recalibration can be found on GATK website.
module load gatk/3.7.0
gatk -T BaseRecalibrator -R path_to_genome.fa \
-I ./results/star_2pass/SRR1608907Aligned.reorderedDupl.rg.split.bam -nct 8 \
-knownSites path_to_GATK_Bundle_files/dbsnp_138.hg19.excluding_sites_after_129.vcf \
-knownSites path_to_GATK_Bundle_files/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \
-knownSites path_to_GATK_Bundle_files/1000G_phase1.indels.hg19.sites.vcf \
-o ./results/star_2pass/BaseQRecal/SRR1608907/SRR1608907_recal_data.table \
--log_to_file ./results/star_2pass/BaseQRecal/SRR1608907/SRR1608907_recal_step1_log
gatk -T BaseRecalibrator -R path_hg19_reference/genome.fa \
-I ./results/star_2pass/SRR1608907Aligned.reorderedDupl.rg.split.bam -nct 8 \
-knownSites path_to_GATK_Bundle_files/dbsnp_138.hg19.excluding_sites_after_129.vcf \
-knownSites path_to_GATK_Bundle_files/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \
-knownSites path_to_GATK_Bundle_files/1000G_phase1.indels.hg19.sites.vcf \
-BQSR ./results/star_2pass/BaseQRecal/SRR1608907/SampleName_recal_data.table \
-o ./results/star_2pass/BaseQRecal/SRR1608907/SRR1608907_post_recal_data.table \
--log_to_file ./results/star_2pass/BaseQRecal/SRR1608907/SRR1608907_recal_step2_log
gatk -T AnalyzeCovariates -R path_hg19_reference/genome.fa \
-before ./results/star_2pass/BaseQRecal/SRR1608907/SSRR1608907_recal_data.table \
-after ./results/star_2pass/BaseQRecal/SRR1608907/SRR1608907_post_recal_data.table \
-csv ./results/star_2pass/BaseQRecal/SRR1608907/SSRR1608907_recalibration_plots.csv \
-plots ./results/star_2pass/BaseQRecal/SRR1608907/SRR1608907_recalibration_plots.pdf \
--log_to_file ./results/star_2pass/BaseQRecal/SRR1608907/SRR1608907_recal_analyseCov_log
gatk -T PrintReads -R path_hg19_reference/genome.fa \
-I ./results/star_2pass/SRR1608907Aligned.reorderedDupl.rg.split.bam \
-o ./results/star_2pass/SRR1608907Recal.reorderedDupl.rg.split.bam \
-nct 8 -BQSR ./results/star_2pass/BaseQRecal/SRR1608907/SRR1608907_post_recal_data.table \
--log_to_file ./results/star_2pass/BaseQRecal/SRR1608907/SRR1608907_Log_recalibrated_bases
7.3 Summary with MultiQC
MultiQC also support output from GATK https://multiqc.info/docs/#gatk and you can easily create a report for the summary/log files created above.