Evaluation of Mapping and Germline Variant Calling Pipelines on Australian High-Performance Computing Facilities Report

Autor: Samaha, Georgina, Chew, Tracy, Willet, Cali, Beecroft, Sarah, Davis, Brian, Sadsad, Rosemarie
Jazyk: angličtina
Rok vydání: 2022
Předmět:
DOI: 10.5281/zenodo.6930813
Popis: Life scientists are increasingly using whole genome sequencing (WGS) to ask and answer research questions across the tree of life. Whole genome sequencing is the largest and most commonly stored data type across 27 organisations surveyed in Australia. Of these organisations, 82% use command-line interface platforms and as sequencing becomes more affordable, an increasing number of life-scientists are using these technologies at scale. Processing WGS data is a computationally challenging, multi-step process used to create a map of an individual’s genome and identify genetic variant sites. To do this work, Australian life scientists seek best practice pipelines that are accurate, highly accessible, well documented and are accompanied with user support or training. Researchers experience challenges with deploying these pipelines on local HPC infrastructures at scale, largely because many of the best practice tools and workflows they use are not developed for large scale use, or for HPC useage paradigms. Recently, scalable pipelines to process and analyse WGS data have been developed and made publicly available. Scalability of these pipelines on HPC infrastructure is achieved by re-engineering best practice pipelines to efficiently utilise compute hardware and by replacing recommended tools with tools that are more computationally performant. Proper technical evaluation is required to determine whether their biological accuracy is maintained alongside improvements in computational efficiency. Here, we report the biological accuracy, technical performance, and user experience of two scalable WGS pipelines that perform short read mapping to a reference genome assembly and germline short variant discovery of single nucleotide variantsand insertions and deletions. Both workflows are implementations of the BROAD Institute’s best practices pipelines and are widely adopted in the community. We used the gold standard Platinum Genomes datasets to report metrics to evaluate: 1. NVIDIA’s Clara Parabricks GPU-enabled Pipelines (NVIDIA 2020) and 2. The Sydney Informatics Hub’s (SIH) Scalable multi CPU node pipelines, both deployed on the National Computational Infrastructure's HPC, Gadi.
This national project involves a range of partners and is sponsored by Australian BioCommons and Australian Research Data Commons (ARDC). The Australian BioCommons is supported by Bioplatforms Australia. Bioplatforms Australia and ARDC are enabled by NCRIS.
Databáze: OpenAIRE