9 varikondo
and superFreq
: Standardise and combine variants over time
We are now at the stage where variants are called and we have loads of VCF
files/ exel files. The scripts in the previous sections helps with calling variants using 4 callers (VarDict, Varscan, MuTect2, km and freebayes).
At this stage, depending on how one needs to use the variants, there are several options discussed below. To make it easier to import variant calls into R and store them in data frame I developed the R package varikondo
which hopefully will help you to find the spark joy in your variants!
Below are some applications but for a more in depth description of its functionalities see the package website https://annaquaglieri16.github.io/varikondo/articles/vignette.html.
9.1 parse_vcf_output
: read in your single sample VCF file in R
If you simply want to read a VCF file in R you can use parse_vcf_output
which currently support tumour-only variants from GATK MuTect2
, VarDict
, VarScan
and Freebayes
.
library(varikondo)
annot_vcf_mutect <- system.file("extdata", "chr20_mutect.vcf.gz", package = "varikondo")
annot_vcf_mutect
## [1] "/Users/quaglieri.a/Library/R/3.5/library/varikondo/extdata/chr20_mutect.vcf.gz"
parsed_vcf_mutect <- varikondo::parse_vcf_output(vcf_path = annot_vcf_mutect,
caller = "mutect",
sample_name = "Sample1",
vep = TRUE)
knitr::kable(parsed_vcf_mutect[1:10,],caption = "Parsed MuTect2 output, from VCF to data frame.")
Location | caller | chrom | pos | end | ref | alt | qual | filter | genotype | tot_depth | VAF | ref_depth | alt_depth | ref_forw | ref_rev | alt_forw | alt_rev | start | width | ref_base_quality | alt_base_quality | SampleName | Allele | Consequence | IMPACT | SYMBOL | Gene | Feature_type | Feature | BIOTYPE | EXON | INTRON | HGVSc | HGVSp | cDNA_position | CDS_position | Protein_position | Amino_acids | Codons | Existing_variation | DISTANCE | STRAND | FLAGS | VARIANT_CLASS | SYMBOL_SOURCE | HGNC_ID | CANONICAL | TSL | APPRIS | CCDS | ENSP | SWISSPROT | TREMBL | UNIPARC | GENE_PHENO | SIFT | PolyPhen | DOMAINS | AF | AFR_AF | AMR_AF | EAS_AF | EUR_AF | SAS_AF | AA_AF | EA_AF | ExAC_AF | ExAC_Adj_AF | ExAC_AFR_AF | ExAC_AMR_AF | ExAC_EAS_AF | ExAC_FIN_AF | ExAC_NFE_AF | ExAC_OTH_AF | ExAC_SAS_AF | MAX_AF | MAX_AF_POPS | CLIN_SIG | SOMATIC | PHENO | PUBMED | MOTIF_NAME | MOTIF_POS | HIGH_INF_POS | MOTIF_SCORE_CHANGE | IMPACT_rank |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
chr20_30948281 | mutect2 | chr20 | 30948281 | 30948281 | A | G | 11.000 | PASS | 0/1 | 3 | 1 | 0 | 3 | 0 | 0 | 0 | 3 | 30948281 | 1 | 0 | 66 | Sample1 | G | intron_variant | MODIFIER | ASXL1 | ENSG00000171456 | Transcript | ENST00000306058 | protein_coding | 1/11 | rs2208131 | 1 | SNV | HGNC | 18318 | ENSP00000305119 | A6NIZ6 | UPI00015DF8B6 | 1 | 0.4093 | 0.1672 | 0.4337 | 0.744 | 0.3499 | 0.4356 | 0.744 | EAS | 1 | |||||||||||||||||||||||||||||||||||||
chr20_30948281 | mutect2 | chr20 | 30948281 | 30948281 | A | G | 11.000 | PASS | 0/1 | 3 | 1 | 0 | 3 | 0 | 0 | 0 | 3 | 30948281 | 1 | 0 | 66 | Sample1 | G | intron_variant | MODIFIER | ASXL1 | ENSG00000171456 | Transcript | ENST00000375687 | protein_coding | 1/12 | rs2208131 | 1 | SNV | HGNC | 18318 | YES | CCDS13201.1 | ENSP00000364839 | Q8IXJ9 | UPI000036702C | 1 | 0.4093 | 0.1672 | 0.4337 | 0.744 | 0.3499 | 0.4356 | 0.744 | EAS | 1 | |||||||||||||||||||||||||||||||||||
chr20_30948281 | mutect2 | chr20 | 30948281 | 30948281 | A | G | 11.000 | PASS | 0/1 | 3 | 1 | 0 | 3 | 0 | 0 | 0 | 3 | 30948281 | 1 | 0 | 66 | Sample1 | G | intron_variant | MODIFIER | ASXL1 | ENSG00000171456 | Transcript | ENST00000375689 | protein_coding | 1/4 | rs2208131 | 1 | SNV | HGNC | 18318 | ENSP00000364841 | Q5JWS8 | UPI00004A2D4D | 1 | 0.4093 | 0.1672 | 0.4337 | 0.744 | 0.3499 | 0.4356 | 0.744 | EAS | 1 | |||||||||||||||||||||||||||||||||||||
chr20_30948281 | mutect2 | chr20 | 30948281 | 30948281 | A | G | 11.000 | PASS | 0/1 | 3 | 1 | 0 | 3 | 0 | 0 | 0 | 3 | 30948281 | 1 | 0 | 66 | Sample1 | G | intron_variant | MODIFIER | ASXL1 | ENSG00000171456 | Transcript | ENST00000497249 | protein_coding | 1/4 | rs2208131 | 1 | cds_start_NF | SNV | HGNC | 18318 | ENSP00000451216 | UPI00021CEF59 | 1 | 0.4093 | 0.1672 | 0.4337 | 0.744 | 0.3499 | 0.4356 | 0.744 | EAS | 1 | |||||||||||||||||||||||||||||||||||||
chr20_30948281 | mutect2 | chr20 | 30948281 | 30948281 | A | G | 11.000 | PASS | 0/1 | 3 | 1 | 0 | 3 | 0 | 0 | 0 | 3 | 30948281 | 1 | 0 | 66 | Sample1 | G | intron_variant | MODIFIER | ASXL1 | ENSG00000171456 | Transcript | ENST00000542461 | protein_coding | 1/5 | rs2208131 | 1 | SNV | HGNC | 18318 | ENSP00000438654 | Q6P1M8 | UPI0000232A47 | 1 | 0.4093 | 0.1672 | 0.4337 | 0.744 | 0.3499 | 0.4356 | 0.744 | EAS | 1 | |||||||||||||||||||||||||||||||||||||
chr20_30948281 | mutect2 | chr20 | 30948281 | 30948281 | A | G | 11.000 | PASS | 0/1 | 3 | 1 | 0 | 3 | 0 | 0 | 0 | 3 | 30948281 | 1 | 0 | 66 | Sample1 | G | non_coding_transcript_exon_variant | MODIFIER | ASXL1 | ENSG00000171456 | Transcript | ENST00000555343 | processed_transcript | 1/7 | 462 | rs2208131 | 1 | SNV | HGNC | 18318 | 1 | 0.4093 | 0.1672 | 0.4337 | 0.744 | 0.3499 | 0.4356 | 0.744 | EAS | 1 | |||||||||||||||||||||||||||||||||||||||
chr20_30953162 | mutect2 | chr20 | 30953162 | 30953162 | T | C | 15.125 | PASS | 0/1 | 4 | 1 | 0 | 4 | 0 | 0 | 4 | 0 | 30953162 | 1 | 0 | 121 | Sample1 | C | intron_variant | MODIFIER | ASXL1 | ENSG00000171456 | Transcript | ENST00000306058 | protein_coding | 1/11 | rs6141294 | 1 | SNV | HGNC | 18318 | ENSP00000305119 | A6NIZ6 | UPI00015DF8B6 | 1 | 0.4087 | 0.1672 | 0.4337 | 0.744 | 0.3499 | 0.4325 | 0.744 | EAS | 1 | |||||||||||||||||||||||||||||||||||||
chr20_30953162 | mutect2 | chr20 | 30953162 | 30953162 | T | C | 15.125 | PASS | 0/1 | 4 | 1 | 0 | 4 | 0 | 0 | 4 | 0 | 30953162 | 1 | 0 | 121 | Sample1 | C | intron_variant | MODIFIER | ASXL1 | ENSG00000171456 | Transcript | ENST00000375687 | protein_coding | 1/12 | rs6141294 | 1 | SNV | HGNC | 18318 | YES | CCDS13201.1 | ENSP00000364839 | Q8IXJ9 | UPI000036702C | 1 | 0.4087 | 0.1672 | 0.4337 | 0.744 | 0.3499 | 0.4325 | 0.744 | EAS | 1 | |||||||||||||||||||||||||||||||||||
chr20_30953162 | mutect2 | chr20 | 30953162 | 30953162 | T | C | 15.125 | PASS | 0/1 | 4 | 1 | 0 | 4 | 0 | 0 | 4 | 0 | 30953162 | 1 | 0 | 121 | Sample1 | C | intron_variant | MODIFIER | ASXL1 | ENSG00000171456 | Transcript | ENST00000375689 | protein_coding | 1/4 | rs6141294 | 1 | SNV | HGNC | 18318 | ENSP00000364841 | Q5JWS8 | UPI00004A2D4D | 1 | 0.4087 | 0.1672 | 0.4337 | 0.744 | 0.3499 | 0.4325 | 0.744 | EAS | 1 | |||||||||||||||||||||||||||||||||||||
chr20_30953162 | mutect2 | chr20 | 30953162 | 30953162 | T | C | 15.125 | PASS | 0/1 | 4 | 1 | 0 | 4 | 0 | 0 | 4 | 0 | 30953162 | 1 | 0 | 121 | Sample1 | C | upstream_gene_variant | MODIFIER | ASXL1 | ENSG00000171456 | Transcript | ENST00000470145 | processed_transcript | rs6141294 | 1025 | 1 | SNV | HGNC | 18318 | 1 | 0.4087 | 0.1672 | 0.4337 | 0.744 | 0.3499 | 0.4325 | 0.744 | EAS | 1 |
9.2 superFreq
and varikodno::import_goi_superfreq
to combine SNVs/short insertions over time
If you have data collected for the same patient over time you can use software like superFreq
to track changing clones over the course of treatment for example. Figure 1.1 shows the input necessary for superFreq
and full documentation/manuscript/examples are available on GitHub https://github.com/ChristofferFlensburg/superFreq. The chunk below is an example taken from one of the runs that I did for RNA samples, setting the mode = RNA
option. There main input for superFreq
are: VCF files (in my case normally from VarScan); bamfiles (including Panel Of Normal (PON) samples); and metadata information (see example in Table ??).
superFreq
will create two main output folder specified when running the function: plots = './superFreq/plots'
and rdir = './superFreq/R'
.
You can then use varikondo::import_goi_superfreq
to variants found on genes of interest into R!
import_sf <- varikondo::import_goi_superfreq(superFreq_R_path = superFreq_R_path,
superFreq_meta_path = superFreq_meta_path,
studyGenes,
patientID = "D1",
ref_genome = "hg38",
VAFcut = 0.15)
9.3 import_any
: combine variants over time for any caller
In the pipeline (Figure 1.1) that I applied to the RNA-Seq cohorts that I was involved with, I cann INDELs with VarDict
and km
for every single sample independently and then filter and curate the results. superFreq
only supports point mutations and short INDELs which is why I won’t use for more complex INDELs reported in VCF files by VarDict
. My strategy to combine and filter variants within the same patient is to filter them simultaneusly and to retain only those variants which passes specific threshold at one time points. The rational being here that if a variant is too low to be independently detected at one time point it might still be kept if it is present at a good level at a different time.
To accomplish this, I use the varikondo::import_any
function which takes in input a data frame, as created by import_vcf_output
and filter variants by patient. Here is an example on how to jointly use import_vcf_output
and import_any
to obtain a final filtered and combine version of VarDict calls for one patients with samples over time.