Bridging the gap between HPC and big data frameworks

Autor: Shaden Smith, Theodore L. Willke, Zheguang Zhao, Subramanya R. Dulloor, Narayanan Sundaram, Mihai Capota, Michael R. Anderson, Nadathur Satish
Rok vydání: 2017
Předmět:
Zdroj: Proceedings of the VLDB Endowment. 10:901-912
ISSN: 2150-8097
Popis: Apache Spark is a popular framework for data analytics with attractive features such as fault tolerance and interoperability with the Hadoop ecosystem. Unfortunately, many analytics operations in Spark are an order of magnitude or more slower compared to native implementations written with high performance computing tools such as MPI. There is a need to bridge the performance gap while retaining the benefits of the Spark ecosystem such as availability, productivity, and fault tolerance. In this paper, we propose a system for integrating MPI with Spark and analyze the costs and benefits of doing so for four distributed graph and machine learning applications. We show that offloading computation to an MPI environment from within Spark provides 3.1−17.7× speedups on the four sparse applications, including all of the overheads. This opens up an avenue to reuse existing MPI libraries in Spark with little effort.
Databáze: OpenAIRE