Characterizing and benchmarking stand-alone Hadoop MapReduce on modern HPC clusters
Autor: | Dhabaleswar K. Panda, Md. Wasi-ur-Rahman, Xiaoyi Lu, Dipti Shankar, Nusrat Sharmin Islam |
---|---|
Rok vydání: | 2016 |
Předmět: |
020203 distributed computing
Remote direct memory access business.industry Computer science Big data InfiniBand 020206 networking & telecommunications 02 engineering and technology computer.software_genre Theoretical Computer Science Hardware and Architecture 0202 electrical engineering electronic engineering information engineering Benchmark (computing) Operating system Lustre (file system) business computer Software Information Systems |
Zdroj: | The Journal of Supercomputing. 72:4573-4600 |
ISSN: | 1573-0484 0920-8542 |
DOI: | 10.1007/s11227-016-1760-5 |
Popis: | With the emergence of high-performance data analytics, the Hadoop platform is being increasingly used to process data stored on high-performance computing clusters. While there is immense scope for improving the performance of Hadoop MapReduce (including the network-intensive shuffle phase) over these modern clusters, that are equipped with high-speed interconnects such as InfiniBand and 10/40 GigE, and storage systems such as SSDs and Lustre, it is essential to study the MapReduce component in an isolated manner. In this paper, we study popular MapReduce workloads, obtained from well-accepted, comprehensive benchmark suites, to identify common shuffle data distribution patterns. We determine different environmental and workload-specific factors that affect the performance of the MapReduce job. Based on these characterization studies, we propose a micro-benchmark suite that can be used to evaluate the performance of stand-alone Hadoop MapReduce, and demonstrate its ease-of-use with different networks/protocols, Hadoop distributions, and storage architectures. Performance evaluations with our proposed micro-benchmarks show that stand-alone Hadoop MapReduce over IPoIB performs better than 10 GigE by about 13–15 %, and the RDMA-enhanced hybrid MapReduce design can achieve up to 43 % performance improvement over default Hadoop MapReduce over IPoIB, in both shared-nothing and shared storage architectures. |
Databáze: | OpenAIRE |
Externí odkaz: |