Evaluating somatic variant callers remains challenging due to lack of good reference datasets. With that in mind, we have recently sequenced and analyzed high coverage whole genome data from tumor cell lines along with their matched normal cell lines (Arora et al., 2019). Another good strategy for benchmarking somatic single nucleotide variants (SNVs) and small insertions or deletions (INDELs) is “virtual tumors” (Cibulskis et al., 2013). Virtual tumors serve as very well-controlled datasets because the mutations are artificially introduced (“spiked-in”) at known sites and at known frequencies. We generated virtual tumors, and evaluated a number of somatic variant callers as well as NYGC’s somatic pipelines: current version (v6) (composed of MuTect2, Strelka2, Lancet and SvABA for SNV/INDEL calling) and the previous version (v5B) (composed of MuTect, Strelka LoFreq, Scalpel and Pindel for SNV/INDEL calling), using these virtual tumors.

Virtual tumor creation

We used the strategy described in the MuTect paper (Cibulskis et al., 2013), with a few modifications. Here we used whole genome data sequenced on HiSeqX from HapMap samples NA12891 and NA12892. We spiked-in reads from NA12891 into NA12892 at sites that were known germline variants for which NA12891 was homozygous alternate (genotype: 1/1) and NA12892 was homozygous reference (genotype: 0/0). At each of the selected sites, a select number of reads (N) from NA12892 were replaced with N reads from NA12891, where N was determined by a binomial distribution using a specified allelic fraction chosen randomly from values (0.05,0.1,0.2,0.3).

We then ran samtools mpileup (base and mapping quality cutoffs of 5) on the NA12891 reads that were spiked-in to identify variants in those reads. Any variant that was called this way and that also overlapped with the sites that were intentionally spiked-in were treated as the truth set. On the other hand, unintentionally spiked-in sites (which could be from germline mutations that were on the same reads as the mutation that was intentionally spiked-in) were ignored, and so were the sites that were spiked-in but not detected with mpileup.

Figure 1: Virtual tumors as benchmarking datasets for evaluation of somatic variant callers.
Strategies for (A) creation of virtual tumors and (B) evaluation of somatic variant callers run on them.

We created two separate virtual tumors for spiked-in SNVs and INDELs. There were 31,138 SNVs in the truth set of the SNV virtual tumor, and 4,930 INDELs in the INDEL virtual tumor.

Figure 2: Variant allele frequency (VAF) distribution of the spiked-in variants.
Histograms of VAF of spiked-in (A) SNVs (n=31,138) and (B) INDELs (n=4,930) in the respective virtual tumors.The colors represent the allelic fractions used as probabilities (p) for the binomial distribution used to spike the variants.


For any caller or pipeline,
True positives (TP) = the variants that were called and also in the truth set.
False positives (FP) = variants that were called but were neither in the truth set nor in the set of sites to be ignored.
False negatives (FN) = variants in the truth set that were not called.
Precision = TPTP+FP

Recall = TPTP+FN

F1 score = 2.Precision.RecallPrecision+Recall

Amongst the variant callers that we evaluated, Strelka2 (v2.9.3) and Lancet (v1.0.1) were the two most accurate (highest F1 score) callers for both SNV and INDEL virtual tumors. Overall, NYGC v6 pipeline’s AllSomatic callset was the most sensitive, and HighConfidence callset was the most accurate for both SNVs and INDELs.

Figure 3: Performance summary on SNV virtual tumor.
*Note: The NYGC pipelines were run without the common germline and panel of normal filtering steps.

Figure 4: Performance of callers and NYGC pipelines on somatic INDELs virtual tumor.
*Note: The NYGC pipelines were run without the common germline and panel of normal filtering steps.

Most of the false negative calls were low VAF variants (Figure 5). The newer versions of tools and our current pipeline are much more sensitive at calling lower frequency variants compared to their predecessors. The improvement in sensitivity in detecting low frequency INDELs is especially striking.

Figure 5: Number of true positive variants called at different variant allele frequency bins.

It’s important to note here that we excluded v6 pipeline’s common germline and panel of normal filtering steps (and likewise v5B pipeline’s known germline and blacklist filtering steps) for this evaluation, since the spiked-in variants were all common germline variants. For real tumor-normal data, we believe that the precision of the pipelines would be even better when these steps are run.

To exemplify the importance of NYGC v6 pipeline’s filtering steps, we ran our pipeline on 80X and 40X data from NA12892 from different sequencing lanes, treating them as “tumor” and “normal” respectively. Any variant called on this pairing would be a false positive. Figure 6 shows the number of variants called in v6 pipeline’s AllSomatic and HighConfidence datasets without and with the filtering steps. For both SNVs and INDELs, we see a remarkable reduction in the number of false positive calls after filtering, especially in the AllSomatic callset.

Figure 6: Reduction of false positive calls by the custom filtering steps in NYGC’s v6 pipeline.
(A) Virtual tumor for evaluating false positive calls was generated by using NA12892 data from different sequencing lanes. (B) Number of variants called in AllSomatic and HighConfidence callsets without and with the pipeline’s filtering steps.


Conclusion and Remarks

The virtual tumor approach, which was proposed several years ago, continues to be an effective way of benchmarking somatic SNV and INDEL callers and pipelines. This, along with the high coverage tumor-normal cell lines have proven to be very useful datasets for us for the purpose of benchmarking cancer tools and pipelines. We continue to generate new virtual tumors for different sequencing platforms/technologies used in-house, and to run our pipelines on them to check for their compatibility.
We find that our current pipeline, which combines different variant callers, outperforms any individual caller on the virtual tumor dataset. The performance results are very similar to what we found with the cancer cell lines data, which gives us the confidence that virtual tumors are good datasets for benchmarking.


This work is part of a larger effort in NYGC’s CompBio group of cancer pipeline development and improvement. The other CompBio members that were involved in this effort include Minita Shah, Rashesh Sanghvi, Molly Johnson, Jennifer Shelton, Giuseppe Narzisi and Nicolas Robine.


  • Arora, K., Shah, M., Johnson, M., Sanghvi, R., Shelton, J., Nagulapalli, K., … Robine, N. (2019). Deep whole-genome sequencing of 3 cancer cell lines on 2 sequencing platforms. Scientific reports, 9(1), 19123. doi:10.1038/s41598-019-55636-3
  • Cibulskis, K., Lawrence, M. S., Carter, S. L., Sivachenko, A., Jaffe, D., Sougnez, C., … Getz, G. (2013). Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nature biotechnology, 31(3), 213–219. doi:10.1038/nbt.2514
  • Narzisi, G., Corvelo, A., Arora, K., Bergmann, E. A., Shah, M., Musunuri, R., … Zody, M. C. (2018). Genome-wide somatic variant calling using localized colored de Bruijn graphs. Communications biology, 1, 20. doi:10.1038/s42003-018-0023-9
  • Kim, S., Scheffler, K., Halpern, A. L., Bekritsky, M. A., Noh, E., Källberg, M., … Saunders, C. T. (2018). Strelka2: fast and accurate calling of germline and somatic variants. Nature Methods, 15(8), 591–594. doi: 10.1038/s41592-018-0051-x
  • Saunders, C. T., Wong, W. S. W., Swamy, S., Becq, J., Murray, L. J., & Cheetham, R. K. (2012). Strelka: accurate somatic small-variant calling from sequenced tumor–normal sample pairs. Bioinformatics, 28(14), 1811–1817. doi: 10.1093/bioinformatics/bts271
  • Ye, K., Schulz, M. H., Long, Q., Apweiler, R., & Ning, Z. (2009). Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics (Oxford, England), 25(21), 2865–2871. doi:10.1093/bioinformatics/btp394
  • Documentation on NYGC’s cancer pipelines