Bloomfish: A Highly Scalable Distributed K-mer Counting Framework
Autor: | Yanfei Guo, Michela Taufer, Pietro Cicotti, Bingqiang Wang, Pavan Balaji, Yutong Lu, Tao Gao, Yanjie Wei |
---|---|
Rok vydání: | 2017 |
Předmět: |
0301 basic medicine
Computer science business.industry Hash function Sequence assembly Genomics Parallel computing Supercomputer Genome 03 medical and health sciences chemistry.chemical_compound ComputingMethodologies_PATTERNRECOGNITION 030104 developmental biology chemistry Analytics k-mer Scalability Memory footprint business DNA Reference genome |
Zdroj: | ICPADS |
Popis: | K-mer counting is a fundamental operation in DNA research and genome analytics; its application includes estimating genome assembly, understanding similarities in genomic samples, and merging a newly processed genome with a reference genome. As the genome dataset becomes larger and larger, designing a highly optimized distributed-memory implementation becomes more and more important. Current distributed-memory solutions have two limitations: they have a high memory footprint, and they do not provide advanced optimizations for loading enormous genome datasets into memory. Based on these observations, we present Bloomfish, a distributed, memory-efficient, scalable solution to the limits of current work. To keep a low memory footprint, Bloomfish leverages the compact hash array design of the single-node Jellyfish system and the optimized workflow of the high-performance MapReduce framework Mimir. We have also codesigned Mimir’s I/O to efficiently load enormous datasets. We ran Bloomfish on the Tianhe-2 supercomputer with large sequence datasets (up to 24 TB). Our results show that Bloomfish achieves unprecedented scalability in genome analytics. |
Databáze: | OpenAIRE |
Externí odkaz: |