8 Variant calling
You can perform variant calling on the whole genome or on specific regions that you can specify with a .bed
file.
Here we show how to call variants on specific regions of interest. The target regions were created using the gene symbols of the mutations listed in Supplemental Table 3 of (Lavallée et al. 2016) and listed in ./data/mutations_Lavallee_2016.csv
. We used the hg19 inbuilt annotation of Rsubread
to obtain the gene ranges and added 500 bp at the end and at the beginning of each gene. The final bed
files is ./data/target_regions.bed
.
View all the options in the ./functions/call_variants.R
function.
Rscript ./functions/call_variants.R --help
Below is an example on how to call variants with VarDict
and VarScan
in tumour-only mode and annotate them with the Variant Effect Preditor (VEP). The directory needed for VarDict can be downloaded from the software GitHub page https://github.com/AstraZeneca-NGS/VarDict.
For other caller options:
-c CHARACTER, --variant_caller=CHARACTER
Variant caller to be used. Possible values are 'mutect', 'varscan', 'freebayes' and 'vardict'. Samtools mpileup is used to create VarScan input.
The function above is a wrapper for the calls done by each caller. Below are details of the actual caller settings that I normally use to call variants with these callers.
8.1 Call with MuTect2
module load gatk/3.7.0
gatk -T MuTect2 -R path_to_genome.fa \
-I:tumor ./results/star_2pass/SRR1608907_Aligned.reorderedDupl.rg.split.recalibrated.bam \
-L ./data/target_regions.bed \
-o ./results/mutect/regions/SampleName_germline_snvs_indels.vcf \
-log ./results/mutect/regions/SampleName_germline_snvs_indels_log
8.2 Call with Samtools + VarScan2
module load varscan/2.3.9
module load samtools/1.6
samtools mpileup --output-tags AD,ADF,ADR,DP,SP \
--fasta-ref path_to_genome.fa \
-l ./data/target_regions.bed ./results/star_2pass/SRR1608907_Aligned.reorderedDupl.rg.split.recalibrated.bam | varscan mpileup2cns --variants 1 --output-vcf 1 --min-var-freq 0.01 > ./results/varscan/regions/SampleName_germline_snvs_indels.vcf 2> ./results/varscan/regions/SampleName_germline_snvs_indels_log
8.3 Call with VarDict
The teststrandbias.R
and var2vcf_valid.pl
scripts were downloaded from https://github.com/AstraZeneca-NGS/VarDict. VarDict only calls variants on a target region.
module load vardict/1.5.1
vardict -f 0.05 -c 1 -S 2 -E 3 -g 4 -r 2 -t -th 10 -v -G path_to_genome.fa \
-b ./results/star_2pass/SRR1608907_Aligned.reorderedDupl.rg.bam ./data/target_regions.bed | vardict_dir/teststrandbias.R | vardict_dir/var2vcf_valid.pl -N -E -f 0.05 > ./results/vardict/regions/SampleName_germline_snvs_indels.vcf
8.4 Annotate variants
This step is performed within the call_variants.R
function when the path to the VEP cache library is provided.
--VEPcall 'vep --dir_cache path_to_vep_cache/.vep --offline' \
module load ensembl-vep/89.0
vep --dir_cache dir_to_VEP_cache/.vep --offline \
-i ./results/varscan/regions/SampleName_germline_snvs_indels.vcf \
-o ./results/varscan/regions/annotated_variants/SampleName_germline_annotated.vcf \
--cache --everything --force_overwrite --assembly GRCh37 --fork 12 --vcf --port 3337
8.5 Targeted INDEL calling with km
algorithm
The km
algorithm (Eric Olivier Audemard, Patrick Gendron, Vincent-Philippe Lavallée, Josée Hébert, Guy Sauvageau, Sébastien Lemieux 2018) is one of the sofware developed for targeted INDEL calls in RNA-Seq. Calling INDELs from RNA-Seq is challenging and several tools have been published between 2017-2019 to accomplish this task. The km
algorithm was benchmarked on the AML Leucegene and TCGA-LAML datasets.
As shown in Figure 1.1 I used it together with VarDict to create a set of INDELs calls from leukemia RNA-Seq. In my approach, I call variants with both callers and curate their ouput by checking INDELs that are called by both callers (only one calls is preserved) or by only one call. In my experience, their calls are pretty similar and I was able to call FLT3 Internal Tandem Duplications on my samples using both callers.