Streamlining the Genomics Processing Pipeline via Named Pipes and Persistent Spark Satasets

Autor: Walter Blair, Larry Davis, Paul E. Anderson, Leonardo de Melo Joao
Rok vydání: 2017
Předmět:
Zdroj: BIBE
DOI: 10.1109/bibe.2017.00-82
Popis: In this paper we investigate the use of Unix named pipes and an in-memory datagrid to reduce the I/O requirements of conventional and exploratory genomics processing pipelines. Apache Spark provides an in-memory framework for distributed computational genomics that has realized significant improvements over conventional pipelines in speed and flexibility. Even in the Spark framework, however, pipeline components create I/O bottlenecks by reading and writing intermediate files that are later discarded. Apache Ignite provides a framework for persisting a Spark dataset in memory between modular pipeline applications, and Unix named pipes have long provided a mechanism by which data can be transferred in-memory. We compared the runtime performance of a standard genomics pipeline that transmits Spark data using named pipes and/or Ignite's in-memory datagrid. Our results demonstrate that Ignite can improve the runtime performance of in-memory RDD actions and that keeping pipeline components in memory with Ignite and named pipes eliminates a major I/O bottleneck.
Databáze: OpenAIRE