Curated variation benchmarks for challenging medically relevant autosomal genes

The repetitive nature and complexity of some medically relevant genes poses a challenge for their accurate analysis in a clinical setting. The Genome in a Bottle Consortium has provided variant benchmark sets, but these exclude nearly 400 medically relevant genes due to their repetitiveness or polymorphic complexity. Here, we characterize 273 of these 395 challenging autosomal genes using a haplotype-resolved whole-genome assembly. This curated benchmark reports over 17,000 single-nucleotide variations, 3,600 insertions and deletions and 200 structural variations each for human genome reference GRCh37 and GRCh38 across HG002. We show that false duplications in either GRCh37 or GRCh38 result in reference-specific, missed variants for short- and long-read technologies in medically relevant genes, including CBSCRYAA and KCNE1. When masking these false duplications, variant recall can improve from 8% to 100%. Forming benchmarks from a haplotype-resolved whole-genome assembly may become a prototype for future benchmarks covering the whole genome.

Nature Biotechnology volume 40, pages 672–680 (2022)


Other Contributors

Justin Wagner, Nathan D. Olson, Lindsay Harris, Jennifer McDaniel, Haoyu Cheng, Arkarachai Fungtammasan, Yih-Chii Hwang, Richa Gupta, Aaron M. Wenger, William J. Rowell, Ziad M. Khan, Jesse Farek, Yiming Zhu, Aishwarya Pisupati, Medhat Mahmoud, Chunlin Xiao, Byunggil Yoo, Sayed Mohammad Ebrahim Sahraeian, Danny E. Miller, David Jáspez, José M. Lorenzo-Salazar, Adrián Muñoz-Barrera, Luis A. Rubio-Rodríguez, Carlos Flores, Giuseppe Narzisi, Uday Shanker Evani, Wayne E. Clarke, Joyce Lee, Christopher E. Mason, Stephen E. Lincoln, Karen H. Miga, Mark T. W. Ebbert, Alaina Shumate, Heng Li, Chen-Shan Chin, Justin M. Zook & Fritz J. Sedlazeck

Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA

Justin Wagner, Nathan D. Olson, Lindsay Harris, Jennifer McDaniel & Justin M. Zook

Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA

Haoyu Cheng & Heng Li

DNAnexus, Inc., Mountain View, CA, USA

Arkarachai Fungtammasan, Yih-Chii Hwang, Richa Gupta & Chen-Shan Chin

Pacific Biosciences, Menlo Park, CA, USA

Aaron M. Wenger & William J. Rowell

Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA

Ziad M. Khan, Jesse Farek, Yiming Zhu, Aishwarya Pisupati, Medhat Mahmoud & Fritz J. Sedlazeck

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA

Chunlin Xiao

Genomic Medicine Center, Children’s Mercy Kansas City, Kansas City, MO, USA

Byunggil Yoo

Roche Sequencing Solutions, Santa Clara, CA, USA

Sayed Mohammad Ebrahim Sahraeian

Department of Pediatrics, Division of Genetic Medicine, University of Washington and Seattle Children’s Hospital, Seattle, WA, USA

Danny E. Miller

Department of Genome Sciences, University of Washington, Seattle, WA, USA

Danny E. Miller

Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Santa Cruz de Tenerife, Spain

David Jáspez, José M. Lorenzo-Salazar, Adrián Muñoz-Barrera, Luis A. Rubio-Rodríguez & Carlos Flores

CIBER de Enfermedades Respiratorias, Instituto de Salud Carlos III, Madrid, Spain

Carlos Flores

Research Unit, Hospital Universitario N.S. de Candelaria, Santa Cruz de Tenerife, Spain

Carlos Flores

New York Genome Center, New York, NY, USA

Giuseppe Narzisi, Uday Shanker Evani & Wayne E. Clarke

Bionano Genomics, San Diego, CA, USA

Joyce Lee

Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY, USA

Christopher E. Mason

Invitae, San Francisco, CA, USA

Stephen E. Lincoln

UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA

Karen H. Miga

Sanders-Brown Center on Aging, University of Kentucky, Lexington, KY, USA

Mark T. W. Ebbert

Department of Internal Medicine, Division of Biomedical Informatics, University of Kentucky, Lexington, KY, USA

Mark T. W. Ebbert

Department of Neuroscience, University of Kentucky, Lexington, KY, USA

Mark T. W. Ebbert

Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA

Alaina Shumate

Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA

Alaina Shumate