Distributed de novo assembler for large-scale long-read datasets

Autor:	Sayan Goswami, Kisung Lee, Seung-Jong Park
Rok vydání:	2020
Předmět:	0303 health sciences Computer science business.industry Big data Process (computing) Word error rate Genomics Parallel computing computer.software_genre Genome DNA sequencing 03 medical and health sciences 0302 clinical medicine 030220 oncology & carcinogenesis Middleware (distributed applications) Nanopore sequencing business computer 030304 developmental biology
Zdroj:	IEEE BigData
DOI:	10.1109/bigdata50022.2020.9377979
Popis:	Third-generation DNA sequencing technologies such as single-molecule real-time sequencing (SMRT) and nanopore sequencing have the potential to fill the gaps in the existing genome databases since the raw sequences produced by these machines are much longer than those of previous generations and therefore result in more contiguous assemblies. However, these long reads have a high error rate, which makes the assembly process computationally challenging. Moreover, since existing long-read assemblers are designed to run on a single machine, they either take days to complete or run out of memory on even moderate-sized datasets. In this paper, we present a distributed long-read assembler that can assemble large-scale noisy sequence datasets on thousands of cores, resulting in orders of magnitude faster assembly times. By effectively using the map-reduce computation model with a distributed hash-map, both built using a high-performance active messaging middleware, we can assemble a PacBio human genome dataset with 139 billion base-pairs (about 130 GB) in about 33 minutes (using 2,560 cores) compared to more than 38 hours (using 28 cores) with the current state-of-the-art assembler.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::1a7e52844d1a8b1e90f85a581154dd25 https://doi.org/10.1109/bigdata50022.2020.9377979 Zobrazit plný text záznamu