9 varikondo and superFreq: Standardise and combine variants over time
We are now at the stage where variants are called and we have loads of VCF files/ exel files. The scripts in the previous sections helps with calling variants using 4 callers (VarDict, Varscan, MuTect2, km and freebayes).
At this stage, depending on how one needs to use the variants, there are several options discussed below. To make it easier to import variant calls into R and store them in data frame I developed the R package varikondo which hopefully will help you to find the spark joy in your variants!
Below are some applications but for a more in depth description of its functionalities see the package website https://annaquaglieri16.github.io/varikondo/articles/vignette.html.
9.1 parse_vcf_output: read in your single sample VCF file in R
If you simply want to read a VCF file in R you can use parse_vcf_output which currently support tumour-only variants from GATK MuTect2, VarDict, VarScan and Freebayes.
library(varikondo)
annot_vcf_mutect <- system.file("extdata", "chr20_mutect.vcf.gz", package = "varikondo")
annot_vcf_mutect
## [1] "/Users/quaglieri.a/Library/R/3.5/library/varikondo/extdata/chr20_mutect.vcf.gz"
parsed_vcf_mutect <- varikondo::parse_vcf_output(vcf_path = annot_vcf_mutect,
caller = "mutect",
sample_name = "Sample1",
vep = TRUE)
knitr::kable(parsed_vcf_mutect[1:10,],caption = "Parsed MuTect2 output, from VCF to data frame.")
| Location | caller | chrom | pos | end | ref | alt | qual | filter | genotype | tot_depth | VAF | ref_depth | alt_depth | ref_forw | ref_rev | alt_forw | alt_rev | start | width | ref_base_quality | alt_base_quality | SampleName | Allele | Consequence | IMPACT | SYMBOL | Gene | Feature_type | Feature | BIOTYPE | EXON | INTRON | HGVSc | HGVSp | cDNA_position | CDS_position | Protein_position | Amino_acids | Codons | Existing_variation | DISTANCE | STRAND | FLAGS | VARIANT_CLASS | SYMBOL_SOURCE | HGNC_ID | CANONICAL | TSL | APPRIS | CCDS | ENSP | SWISSPROT | TREMBL | UNIPARC | GENE_PHENO | SIFT | PolyPhen | DOMAINS | AF | AFR_AF | AMR_AF | EAS_AF | EUR_AF | SAS_AF | AA_AF | EA_AF | ExAC_AF | ExAC_Adj_AF | ExAC_AFR_AF | ExAC_AMR_AF | ExAC_EAS_AF | ExAC_FIN_AF | ExAC_NFE_AF | ExAC_OTH_AF | ExAC_SAS_AF | MAX_AF | MAX_AF_POPS | CLIN_SIG | SOMATIC | PHENO | PUBMED | MOTIF_NAME | MOTIF_POS | HIGH_INF_POS | MOTIF_SCORE_CHANGE | IMPACT_rank |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| chr20_30948281 | mutect2 | chr20 | 30948281 | 30948281 | A | G | 11.000 | PASS | 0/1 | 3 | 1 | 0 | 3 | 0 | 0 | 0 | 3 | 30948281 | 1 | 0 | 66 | Sample1 | G | intron_variant | MODIFIER | ASXL1 | ENSG00000171456 | Transcript | ENST00000306058 | protein_coding | 1/11 | rs2208131 | 1 | SNV | HGNC | 18318 | ENSP00000305119 | A6NIZ6 | UPI00015DF8B6 | 1 | 0.4093 | 0.1672 | 0.4337 | 0.744 | 0.3499 | 0.4356 | 0.744 | EAS | 1 | |||||||||||||||||||||||||||||||||||||
| chr20_30948281 | mutect2 | chr20 | 30948281 | 30948281 | A | G | 11.000 | PASS | 0/1 | 3 | 1 | 0 | 3 | 0 | 0 | 0 | 3 | 30948281 | 1 | 0 | 66 | Sample1 | G | intron_variant | MODIFIER | ASXL1 | ENSG00000171456 | Transcript | ENST00000375687 | protein_coding | 1/12 | rs2208131 | 1 | SNV | HGNC | 18318 | YES | CCDS13201.1 | ENSP00000364839 | Q8IXJ9 | UPI000036702C | 1 | 0.4093 | 0.1672 | 0.4337 | 0.744 | 0.3499 | 0.4356 | 0.744 | EAS | 1 | |||||||||||||||||||||||||||||||||||
| chr20_30948281 | mutect2 | chr20 | 30948281 | 30948281 | A | G | 11.000 | PASS | 0/1 | 3 | 1 | 0 | 3 | 0 | 0 | 0 | 3 | 30948281 | 1 | 0 | 66 | Sample1 | G | intron_variant | MODIFIER | ASXL1 | ENSG00000171456 | Transcript | ENST00000375689 | protein_coding | 1/4 | rs2208131 | 1 | SNV | HGNC | 18318 | ENSP00000364841 | Q5JWS8 | UPI00004A2D4D | 1 | 0.4093 | 0.1672 | 0.4337 | 0.744 | 0.3499 | 0.4356 | 0.744 | EAS | 1 | |||||||||||||||||||||||||||||||||||||
| chr20_30948281 | mutect2 | chr20 | 30948281 | 30948281 | A | G | 11.000 | PASS | 0/1 | 3 | 1 | 0 | 3 | 0 | 0 | 0 | 3 | 30948281 | 1 | 0 | 66 | Sample1 | G | intron_variant | MODIFIER | ASXL1 | ENSG00000171456 | Transcript | ENST00000497249 | protein_coding | 1/4 | rs2208131 | 1 | cds_start_NF | SNV | HGNC | 18318 | ENSP00000451216 | UPI00021CEF59 | 1 | 0.4093 | 0.1672 | 0.4337 | 0.744 | 0.3499 | 0.4356 | 0.744 | EAS | 1 | |||||||||||||||||||||||||||||||||||||
| chr20_30948281 | mutect2 | chr20 | 30948281 | 30948281 | A | G | 11.000 | PASS | 0/1 | 3 | 1 | 0 | 3 | 0 | 0 | 0 | 3 | 30948281 | 1 | 0 | 66 | Sample1 | G | intron_variant | MODIFIER | ASXL1 | ENSG00000171456 | Transcript | ENST00000542461 | protein_coding | 1/5 | rs2208131 | 1 | SNV | HGNC | 18318 | ENSP00000438654 | Q6P1M8 | UPI0000232A47 | 1 | 0.4093 | 0.1672 | 0.4337 | 0.744 | 0.3499 | 0.4356 | 0.744 | EAS | 1 | |||||||||||||||||||||||||||||||||||||
| chr20_30948281 | mutect2 | chr20 | 30948281 | 30948281 | A | G | 11.000 | PASS | 0/1 | 3 | 1 | 0 | 3 | 0 | 0 | 0 | 3 | 30948281 | 1 | 0 | 66 | Sample1 | G | non_coding_transcript_exon_variant | MODIFIER | ASXL1 | ENSG00000171456 | Transcript | ENST00000555343 | processed_transcript | 1/7 | 462 | rs2208131 | 1 | SNV | HGNC | 18318 | 1 | 0.4093 | 0.1672 | 0.4337 | 0.744 | 0.3499 | 0.4356 | 0.744 | EAS | 1 | |||||||||||||||||||||||||||||||||||||||
| chr20_30953162 | mutect2 | chr20 | 30953162 | 30953162 | T | C | 15.125 | PASS | 0/1 | 4 | 1 | 0 | 4 | 0 | 0 | 4 | 0 | 30953162 | 1 | 0 | 121 | Sample1 | C | intron_variant | MODIFIER | ASXL1 | ENSG00000171456 | Transcript | ENST00000306058 | protein_coding | 1/11 | rs6141294 | 1 | SNV | HGNC | 18318 | ENSP00000305119 | A6NIZ6 | UPI00015DF8B6 | 1 | 0.4087 | 0.1672 | 0.4337 | 0.744 | 0.3499 | 0.4325 | 0.744 | EAS | 1 | |||||||||||||||||||||||||||||||||||||
| chr20_30953162 | mutect2 | chr20 | 30953162 | 30953162 | T | C | 15.125 | PASS | 0/1 | 4 | 1 | 0 | 4 | 0 | 0 | 4 | 0 | 30953162 | 1 | 0 | 121 | Sample1 | C | intron_variant | MODIFIER | ASXL1 | ENSG00000171456 | Transcript | ENST00000375687 | protein_coding | 1/12 | rs6141294 | 1 | SNV | HGNC | 18318 | YES | CCDS13201.1 | ENSP00000364839 | Q8IXJ9 | UPI000036702C | 1 | 0.4087 | 0.1672 | 0.4337 | 0.744 | 0.3499 | 0.4325 | 0.744 | EAS | 1 | |||||||||||||||||||||||||||||||||||
| chr20_30953162 | mutect2 | chr20 | 30953162 | 30953162 | T | C | 15.125 | PASS | 0/1 | 4 | 1 | 0 | 4 | 0 | 0 | 4 | 0 | 30953162 | 1 | 0 | 121 | Sample1 | C | intron_variant | MODIFIER | ASXL1 | ENSG00000171456 | Transcript | ENST00000375689 | protein_coding | 1/4 | rs6141294 | 1 | SNV | HGNC | 18318 | ENSP00000364841 | Q5JWS8 | UPI00004A2D4D | 1 | 0.4087 | 0.1672 | 0.4337 | 0.744 | 0.3499 | 0.4325 | 0.744 | EAS | 1 | |||||||||||||||||||||||||||||||||||||
| chr20_30953162 | mutect2 | chr20 | 30953162 | 30953162 | T | C | 15.125 | PASS | 0/1 | 4 | 1 | 0 | 4 | 0 | 0 | 4 | 0 | 30953162 | 1 | 0 | 121 | Sample1 | C | upstream_gene_variant | MODIFIER | ASXL1 | ENSG00000171456 | Transcript | ENST00000470145 | processed_transcript | rs6141294 | 1025 | 1 | SNV | HGNC | 18318 | 1 | 0.4087 | 0.1672 | 0.4337 | 0.744 | 0.3499 | 0.4325 | 0.744 | EAS | 1 |
9.2 superFreq and varikodno::import_goi_superfreq to combine SNVs/short insertions over time
If you have data collected for the same patient over time you can use software like superFreq to track changing clones over the course of treatment for example. Figure 1.1 shows the input necessary for superFreq and full documentation/manuscript/examples are available on GitHub https://github.com/ChristofferFlensburg/superFreq. The chunk below is an example taken from one of the runs that I did for RNA samples, setting the mode = RNA option. There main input for superFreq are: VCF files (in my case normally from VarScan); bamfiles (including Panel Of Normal (PON) samples); and metadata information (see example in Table ??).
superFreq will create two main output folder specified when running the function: plots = './superFreq/plots' and rdir = './superFreq/R'.
You can then use varikondo::import_goi_superfreq to variants found on genes of interest into R!
import_sf <- varikondo::import_goi_superfreq(superFreq_R_path = superFreq_R_path,
superFreq_meta_path = superFreq_meta_path,
studyGenes,
patientID = "D1",
ref_genome = "hg38",
VAFcut = 0.15)
9.3 import_any: combine variants over time for any caller
In the pipeline (Figure 1.1) that I applied to the RNA-Seq cohorts that I was involved with, I cann INDELs with VarDict and km for every single sample independently and then filter and curate the results. superFreq only supports point mutations and short INDELs which is why I won’t use for more complex INDELs reported in VCF files by VarDict. My strategy to combine and filter variants within the same patient is to filter them simultaneusly and to retain only those variants which passes specific threshold at one time points. The rational being here that if a variant is too low to be independently detected at one time point it might still be kept if it is present at a good level at a different time.
To accomplish this, I use the varikondo::import_any function which takes in input a data frame, as created by import_vcf_output and filter variants by patient. Here is an example on how to jointly use import_vcf_output and import_any to obtain a final filtered and combine version of VarDict calls for one patients with samples over time.