Bridging the gap between HPC and big data frameworks
Autor: | Shaden Smith, Theodore L. Willke, Zheguang Zhao, Subramanya R. Dulloor, Narayanan Sundaram, Mihai Capota, Michael R. Anderson, Nadathur Satish |
---|---|
Rok vydání: | 2017 |
Předmět: |
business.industry
Computer science Distributed computing Big data General Engineering 02 engineering and technology Software_PROGRAMMINGTECHNIQUES Bridging (programming) Analytics 020204 information systems Spark (mathematics) 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing business |
Zdroj: | Proceedings of the VLDB Endowment. 10:901-912 |
ISSN: | 2150-8097 |
Popis: | Apache Spark is a popular framework for data analytics with attractive features such as fault tolerance and interoperability with the Hadoop ecosystem. Unfortunately, many analytics operations in Spark are an order of magnitude or more slower compared to native implementations written with high performance computing tools such as MPI. There is a need to bridge the performance gap while retaining the benefits of the Spark ecosystem such as availability, productivity, and fault tolerance. In this paper, we propose a system for integrating MPI with Spark and analyze the costs and benefits of doing so for four distributed graph and machine learning applications. We show that offloading computation to an MPI environment from within Spark provides 3.1−17.7× speedups on the four sparse applications, including all of the overheads. This opens up an avenue to reuse existing MPI libraries in Spark with little effort. |
Databáze: | OpenAIRE |
Externí odkaz: |