9 varikondo and superFreq: Standardise and combine variants over time

We are now at the stage where variants are called and we have loads of VCF files/ exel files. The scripts in the previous sections helps with calling variants using 4 callers (VarDict, Varscan, MuTect2, km and freebayes).

At this stage, depending on how one needs to use the variants, there are several options discussed below. To make it easier to import variant calls into R and store them in data frame I developed the R package varikondo which hopefully will help you to find the spark joy in your variants!

Below are some applications but for a more in depth description of its functionalities see the package website https://annaquaglieri16.github.io/varikondo/articles/vignette.html.

9.1 parse_vcf_output: read in your single sample VCF file in R

If you simply want to read a VCF file in R you can use parse_vcf_output which currently support tumour-only variants from GATK MuTect2, VarDict, VarScan and Freebayes.

library(varikondo)

annot_vcf_mutect <- system.file("extdata", "chr20_mutect.vcf.gz", package = "varikondo")
annot_vcf_mutect
## [1] "/Users/quaglieri.a/Library/R/3.5/library/varikondo/extdata/chr20_mutect.vcf.gz"
parsed_vcf_mutect <- varikondo::parse_vcf_output(vcf_path = annot_vcf_mutect, 
                                                 caller = "mutect",
                                                 sample_name = "Sample1",
                                                 vep = TRUE)
knitr::kable(parsed_vcf_mutect[1:10,],caption = "Parsed MuTect2 output, from VCF to data frame.")
Table 9.1: Parsed MuTect2 output, from VCF to data frame.
Location caller chrom pos end ref alt qual filter genotype tot_depth VAF ref_depth alt_depth ref_forw ref_rev alt_forw alt_rev start width ref_base_quality alt_base_quality SampleName Allele Consequence IMPACT SYMBOL Gene Feature_type Feature BIOTYPE EXON INTRON HGVSc HGVSp cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation DISTANCE STRAND FLAGS VARIANT_CLASS SYMBOL_SOURCE HGNC_ID CANONICAL TSL APPRIS CCDS ENSP SWISSPROT TREMBL UNIPARC GENE_PHENO SIFT PolyPhen DOMAINS AF AFR_AF AMR_AF EAS_AF EUR_AF SAS_AF AA_AF EA_AF ExAC_AF ExAC_Adj_AF ExAC_AFR_AF ExAC_AMR_AF ExAC_EAS_AF ExAC_FIN_AF ExAC_NFE_AF ExAC_OTH_AF ExAC_SAS_AF MAX_AF MAX_AF_POPS CLIN_SIG SOMATIC PHENO PUBMED MOTIF_NAME MOTIF_POS HIGH_INF_POS MOTIF_SCORE_CHANGE IMPACT_rank
chr20_30948281 mutect2 chr20 30948281 30948281 A G 11.000 PASS 0/1 3 1 0 3 0 0 0 3 30948281 1 0 66 Sample1 G intron_variant MODIFIER ASXL1 ENSG00000171456 Transcript ENST00000306058 protein_coding 1/11 rs2208131 1 SNV HGNC 18318 ENSP00000305119 A6NIZ6 UPI00015DF8B6 1 0.4093 0.1672 0.4337 0.744 0.3499 0.4356 0.744 EAS 1
chr20_30948281 mutect2 chr20 30948281 30948281 A G 11.000 PASS 0/1 3 1 0 3 0 0 0 3 30948281 1 0 66 Sample1 G intron_variant MODIFIER ASXL1 ENSG00000171456 Transcript ENST00000375687 protein_coding 1/12 rs2208131 1 SNV HGNC 18318 YES CCDS13201.1 ENSP00000364839 Q8IXJ9 UPI000036702C 1 0.4093 0.1672 0.4337 0.744 0.3499 0.4356 0.744 EAS 1
chr20_30948281 mutect2 chr20 30948281 30948281 A G 11.000 PASS 0/1 3 1 0 3 0 0 0 3 30948281 1 0 66 Sample1 G intron_variant MODIFIER ASXL1 ENSG00000171456 Transcript ENST00000375689 protein_coding 1/4 rs2208131 1 SNV HGNC 18318 ENSP00000364841 Q5JWS8 UPI00004A2D4D 1 0.4093 0.1672 0.4337 0.744 0.3499 0.4356 0.744 EAS 1
chr20_30948281 mutect2 chr20 30948281 30948281 A G 11.000 PASS 0/1 3 1 0 3 0 0 0 3 30948281 1 0 66 Sample1 G intron_variant MODIFIER ASXL1 ENSG00000171456 Transcript ENST00000497249 protein_coding 1/4 rs2208131 1 cds_start_NF SNV HGNC 18318 ENSP00000451216 UPI00021CEF59 1 0.4093 0.1672 0.4337 0.744 0.3499 0.4356 0.744 EAS 1
chr20_30948281 mutect2 chr20 30948281 30948281 A G 11.000 PASS 0/1 3 1 0 3 0 0 0 3 30948281 1 0 66 Sample1 G intron_variant MODIFIER ASXL1 ENSG00000171456 Transcript ENST00000542461 protein_coding 1/5 rs2208131 1 SNV HGNC 18318 ENSP00000438654 Q6P1M8 UPI0000232A47 1 0.4093 0.1672 0.4337 0.744 0.3499 0.4356 0.744 EAS 1
chr20_30948281 mutect2 chr20 30948281 30948281 A G 11.000 PASS 0/1 3 1 0 3 0 0 0 3 30948281 1 0 66 Sample1 G non_coding_transcript_exon_variant MODIFIER ASXL1 ENSG00000171456 Transcript ENST00000555343 processed_transcript 1/7 462 rs2208131 1 SNV HGNC 18318 1 0.4093 0.1672 0.4337 0.744 0.3499 0.4356 0.744 EAS 1
chr20_30953162 mutect2 chr20 30953162 30953162 T C 15.125 PASS 0/1 4 1 0 4 0 0 4 0 30953162 1 0 121 Sample1 C intron_variant MODIFIER ASXL1 ENSG00000171456 Transcript ENST00000306058 protein_coding 1/11 rs6141294 1 SNV HGNC 18318 ENSP00000305119 A6NIZ6 UPI00015DF8B6 1 0.4087 0.1672 0.4337 0.744 0.3499 0.4325 0.744 EAS 1
chr20_30953162 mutect2 chr20 30953162 30953162 T C 15.125 PASS 0/1 4 1 0 4 0 0 4 0 30953162 1 0 121 Sample1 C intron_variant MODIFIER ASXL1 ENSG00000171456 Transcript ENST00000375687 protein_coding 1/12 rs6141294 1 SNV HGNC 18318 YES CCDS13201.1 ENSP00000364839 Q8IXJ9 UPI000036702C 1 0.4087 0.1672 0.4337 0.744 0.3499 0.4325 0.744 EAS 1
chr20_30953162 mutect2 chr20 30953162 30953162 T C 15.125 PASS 0/1 4 1 0 4 0 0 4 0 30953162 1 0 121 Sample1 C intron_variant MODIFIER ASXL1 ENSG00000171456 Transcript ENST00000375689 protein_coding 1/4 rs6141294 1 SNV HGNC 18318 ENSP00000364841 Q5JWS8 UPI00004A2D4D 1 0.4087 0.1672 0.4337 0.744 0.3499 0.4325 0.744 EAS 1
chr20_30953162 mutect2 chr20 30953162 30953162 T C 15.125 PASS 0/1 4 1 0 4 0 0 4 0 30953162 1 0 121 Sample1 C upstream_gene_variant MODIFIER ASXL1 ENSG00000171456 Transcript ENST00000470145 processed_transcript rs6141294 1025 1 SNV HGNC 18318 1 0.4087 0.1672 0.4337 0.744 0.3499 0.4325 0.744 EAS 1

9.2 superFreq and varikodno::import_goi_superfreq to combine SNVs/short insertions over time

If you have data collected for the same patient over time you can use software like superFreq to track changing clones over the course of treatment for example. Figure 1.1 shows the input necessary for superFreq and full documentation/manuscript/examples are available on GitHub https://github.com/ChristofferFlensburg/superFreq. The chunk below is an example taken from one of the runs that I did for RNA samples, setting the mode = RNA option. There main input for superFreq are: VCF files (in my case normally from VarScan); bamfiles (including Panel Of Normal (PON) samples); and metadata information (see example in Table ??).

superFreq will create two main output folder specified when running the function: plots = './superFreq/plots' and rdir = './superFreq/R'.

You can then use varikondo::import_goi_superfreq to variants found on genes of interest into R!

import_sf <- varikondo::import_goi_superfreq(superFreq_R_path = superFreq_R_path,
                                 superFreq_meta_path = superFreq_meta_path,
                                 studyGenes,
                                 patientID = "D1",
                                 ref_genome = "hg38",
                                 VAFcut = 0.15)

9.3 import_any: combine variants over time for any caller

In the pipeline (Figure 1.1) that I applied to the RNA-Seq cohorts that I was involved with, I cann INDELs with VarDict and km for every single sample independently and then filter and curate the results. superFreq only supports point mutations and short INDELs which is why I won’t use for more complex INDELs reported in VCF files by VarDict. My strategy to combine and filter variants within the same patient is to filter them simultaneusly and to retain only those variants which passes specific threshold at one time points. The rational being here that if a variant is too low to be independently detected at one time point it might still be kept if it is present at a good level at a different time.

To accomplish this, I use the varikondo::import_any function which takes in input a data frame, as created by import_vcf_output and filter variants by patient. Here is an example on how to jointly use import_vcf_output and import_any to obtain a final filtered and combine version of VarDict calls for one patients with samples over time.