Benchmarking challenging small variants with linked and long reads

Genome in a Bottle benchmarks are widely used to help validate clinical sequencing pipelines and develop variant calling and sequencing methods. Here we use accurate linked and long reads to expand benchmarks in 7 samples to include difficult-to-map regions and segmental duplications that are challenging for short reads. These benchmarks add more than 300,000 SNVs and 50,000 insertions or deletions (indels) and include 16% more exonic variants, many in challenging, clinically relevant genes not covered previously, such as PMS2. For HG002, we include 92% of the autosomal GRCh38 assembly while excluding regions problematic for benchmarking small variants, such as copy number variants, that should not have been in the previous version, which included 85% of GRCh38. It identifies eight times more false negatives in a short read variant call set relative to our previous benchmark. We demonstrate that this benchmark reliably identifies false positives and false negatives across technologies, enabling ongoing methods development.


Other Contributors

  • 1Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Dr, MS8312, Gaithersburg, MD 20899, USA.
  • 2Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA.
  • 3Seven Bridges, Omladinskih brigada 90g, 11070 Belgrade, Republic of Serbia.
  • 4Children’s Mercy Kansas City, Kansas City, MO, USA.
  • 5Rutgers Cancer Institute of New Jersey, New Brunswick, NJ, USA.
  • 6Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
  • 7New York Genome Center, 101 Avenue of the Americas, New York, NY, USA.
  • 8University of California at Santa Cruz Genomics Institute, 1156 High Street, Santa Cruz, CA, USA.
  • 9Department of Computer Science, Stanford University, Stanford, CA 94305, USA.
  • 10Department of Pathology, Stanford University, Stanford, CA 94305, USA.
  • 11Department of Genetics, Stanford University, Stanford, CA 94305, USA.
  • 12Department of Pediatrics, University of California, San Diego, La Jolla, CA 92093, USA.
  • 13Institute of Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, 40225 Düsseldorf, Germany.
  • 14Terry Fox Laboratory, BC Cancer Research Institute and Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada.
  • 1510X Genomics, Pleasanton, CA 94588, USA.
  • 16National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA.
  • 17DNAnexus, Inc., Mountain View, CA 94040, USA.
  • 18Pacific Biosciences, Menlo Park, CA 94025, USA.
  • 19Google Inc., 1600 Amphitheatre Pkwy., Mountain View, CA 94040, USA.
  • 20Joint Initiative for Metrology in Biology, SLAC National Laboratory, Stanford, CA, USA.