Bridging the gap between HPC and big data frameworks

Autor:	Shaden Smith, Theodore L. Willke, Zheguang Zhao, Subramanya R. Dulloor, Narayanan Sundaram, Mihai Capota, Michael R. Anderson, Nadathur Satish
Rok vydání:	2017
Předmět:	business.industry Computer science Distributed computing Big data General Engineering 02 engineering and technology Software_PROGRAMMINGTECHNIQUES Bridging (programming) Analytics 020204 information systems Spark (mathematics) 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing business
Zdroj:	Proceedings of the VLDB Endowment. 10:901-912
ISSN:	2150-8097
DOI:	10.14778/3090163.3090168
Popis:	Apache Spark is a popular framework for data analytics with attractive features such as fault tolerance and interoperability with the Hadoop ecosystem. Unfortunately, many analytics operations in Spark are an order of magnitude or more slower compared to native implementations written with high performance computing tools such as MPI. There is a need to bridge the performance gap while retaining the benefits of the Spark ecosystem such as availability, productivity, and fault tolerance. In this paper, we propose a system for integrating MPI with Spark and analyze the costs and benefits of doing so for four distributed graph and machine learning applications. We show that offloading computation to an MPI environment from within Spark provides 3.1−17.7× speedups on the four sparse applications, including all of the overheads. This opens up an avenue to reuse existing MPI libraries in Spark with little effort.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::3109cf0a1de3d577a25509af94b04510 https://doi.org/10.14778/3090163.3090168 Zobrazit plný text záznamu