Streamlining the Genomics Processing Pipeline via Named Pipes and Persistent Spark Satasets

Autor:	Walter Blair, Larry Davis, Paul E. Anderson, Leonardo de Melo Joao
Rok vydání:	2017
Předmět:	Unix Computer science business.industry Computational genomics 02 engineering and technology Modular design computer.software_genre Pipeline (software) Bottleneck 020202 computer hardware & architecture Pipeline transport 020204 information systems Spark (mathematics) 0202 electrical engineering electronic engineering information engineering Operating system Named pipe business computer
Zdroj:	BIBE
DOI:	10.1109/bibe.2017.00-82
Popis:	In this paper we investigate the use of Unix named pipes and an in-memory datagrid to reduce the I/O requirements of conventional and exploratory genomics processing pipelines. Apache Spark provides an in-memory framework for distributed computational genomics that has realized significant improvements over conventional pipelines in speed and flexibility. Even in the Spark framework, however, pipeline components create I/O bottlenecks by reading and writing intermediate files that are later discarded. Apache Ignite provides a framework for persisting a Spark dataset in memory between modular pipeline applications, and Unix named pipes have long provided a mechanism by which data can be transferred in-memory. We compared the runtime performance of a standard genomics pipeline that transmits Spark data using named pipes and/or Ignite's in-memory datagrid. Our results demonstrate that Ignite can improve the runtime performance of in-memory RDD actions and that keeping pipeline components in memory with Ignite and named pipes eliminates a major I/O bottleneck.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::8bf1f708adc422290a4a3b9f0af31933 https://doi.org/10.1109/bibe.2017.00-82 Zobrazit plný text záznamu