Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

Autor:	Ching-Hsiang Chu, Dhabaleswar K. Panda, Akshay Venkatesh, Hari Subramoni, Bracy Elton, Khaled Hamidouche
Rok vydání:	2016
Předmět:	020203 distributed computing Remote direct memory access Multicast Computer science InfiniBand 020206 networking & telecommunications 02 engineering and technology Parallel computing CUDA Computer architecture Scalability 0202 electrical engineering electronic engineering information engineering Graphics Latency (engineering) PCI Express
Zdroj:	SBAC-PAD
Popis:	High-performance streaming applications are beginning to leverage the compute power offered by graphics processing units (GPUs) and high network throughput offered by high performance interconnects such as InfiniBand (IB) to boost their performance and scalability. These applications rely heavily on broadcast operations to move data, which is stored in the host memory, from a single source—typically live—to multiple GPU-based computing sites. While homogeneous broadcast designs take advantage of IB hardware multicast feature to boost their performance, their heterogeneous counterpart requires an explicit data movement between Host and GPU, which significantly hurts the overall performance. There is a dearth of efficient heterogeneous broadcast designs for streaming applications especially on emerging multi-GPU configurations. In this work, we propose novel techniques to fully and conjointly take advantage of NVIDIA GPUDirect RDMA (GDR), CUDA inter-process communication (IPC) and IB hardware multicast features to design high-performance heterogeneous broadcast operations for modern multi-GPU systems. We propose intra-node, topology-aware schemes to maximize the performance benefits while minimizing the utilization of valuable PCIe resources. Further, we optimize the communication pipeline by overlapping the GDR + IB hardware multicast operations with CUDA IPC operations. Compared to existing solutions, our designs show up to 3X improvement in the latency of a heterogeneous broadcast operation. Our designs also show up to 67% improvement in execution time of a streaming benchmark on a GPU-dense Cray CS-Storm system with 88 GPUs.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::09e1850ae43387f5edf0f562fb0cb0a0 https://doi.org/10.1109/sbac-pad.2016.16 Zobrazit plný text záznamu