Streamlining the Genomics Processing Pipeline via Named Pipes and Persistent Spark Satasets
Autor: | Walter Blair, Larry Davis, Paul E. Anderson, Leonardo de Melo Joao |
---|---|
Rok vydání: | 2017 |
Předmět: |
Unix
Computer science business.industry Computational genomics 02 engineering and technology Modular design computer.software_genre Pipeline (software) Bottleneck 020202 computer hardware & architecture Pipeline transport 020204 information systems Spark (mathematics) 0202 electrical engineering electronic engineering information engineering Operating system Named pipe business computer |
Zdroj: | BIBE |
DOI: | 10.1109/bibe.2017.00-82 |
Popis: | In this paper we investigate the use of Unix named pipes and an in-memory datagrid to reduce the I/O requirements of conventional and exploratory genomics processing pipelines. Apache Spark provides an in-memory framework for distributed computational genomics that has realized significant improvements over conventional pipelines in speed and flexibility. Even in the Spark framework, however, pipeline components create I/O bottlenecks by reading and writing intermediate files that are later discarded. Apache Ignite provides a framework for persisting a Spark dataset in memory between modular pipeline applications, and Unix named pipes have long provided a mechanism by which data can be transferred in-memory. We compared the runtime performance of a standard genomics pipeline that transmits Spark data using named pipes and/or Ignite's in-memory datagrid. Our results demonstrate that Ignite can improve the runtime performance of in-memory RDD actions and that keeping pipeline components in memory with Ignite and named pipes eliminates a major I/O bottleneck. |
Databáze: | OpenAIRE |
Externí odkaz: |