Bloomfish: A Highly Scalable Distributed K-mer Counting Framework

Autor:	Yanfei Guo, Michela Taufer, Pietro Cicotti, Bingqiang Wang, Pavan Balaji, Yutong Lu, Tao Gao, Yanjie Wei
Rok vydání:	2017
Předmět:	0301 basic medicine Computer science business.industry Hash function Sequence assembly Genomics Parallel computing Supercomputer Genome 03 medical and health sciences chemistry.chemical_compound ComputingMethodologies_PATTERNRECOGNITION 030104 developmental biology chemistry Analytics k-mer Scalability Memory footprint business DNA Reference genome
Zdroj:	ICPADS
Popis:	K-mer counting is a fundamental operation in DNA research and genome analytics; its application includes estimating genome assembly, understanding similarities in genomic samples, and merging a newly processed genome with a reference genome. As the genome dataset becomes larger and larger, designing a highly optimized distributed-memory implementation becomes more and more important. Current distributed-memory solutions have two limitations: they have a high memory footprint, and they do not provide advanced optimizations for loading enormous genome datasets into memory. Based on these observations, we present Bloomfish, a distributed, memory-efficient, scalable solution to the limits of current work. To keep a low memory footprint, Bloomfish leverages the compact hash array design of the single-node Jellyfish system and the optimized workflow of the high-performance MapReduce framework Mimir. We have also codesigned Mimir’s I/O to efficiently load enormous datasets. We ran Bloomfish on the Tianhe-2 supercomputer with large sequence datasets (up to 24 TB). Our results show that Bloomfish achieves unprecedented scalability in genome analytics.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::b60f7a2c00cc6932bafe651269bf4eaf https://doi.org/10.1109/icpads.2017.00033 Zobrazit plný text záznamu