Popis: |
In the era of post Moore's law, the traditional general-purpose CPU is not able to keep the pace up and provide the computing power demanded by the modern compute-intensive and highly parallelizable applications. Under this context, various accelerator architectures such as tensor processing unit (TPU), field-programmable gate array (FPGA), and graphics processing unit (GPU) are being designed to meet the high computational demands. Notably, the GPU has been widely adopted in high-performance computing (HPC) and cloud systems to significantly accelerate numerous scientific and emerging machine/deep learning (ML/DL) applications. To seek more computing power, researchers and engineers are building large-scale GPU clusters, i.e., scale-out. Moreover, the recent advent of high-speed interconnects technology such as NVIDIA NVLink and AMD Infinity fabric enables the deployment of dense GPU systems, i.e., scale-up. As a result, we are witnessing that six out of the top 10 supercomputers, as of July 2020, are powered by thousands of NVIDIA GPUs with NVLink and InfiniBand networks. Driven by the ever large GPU systems, GPU-Aware Message Passing Interface (MPI) has become the standard programming model for developing GPU-enabled parallel applications. However, the state-of-the-art GPU-Aware MPI libraries are predominantly optimized by leveraging advanced technology like Remote Direct Memory Access (RDMA) and NOT exploiting GPUs' computational power. There is a dearth of research in designing GPU-enabled communication middleware that efficiently handles end-to-end networking and harnesses computational power provided by the accelerators.In this thesis, we take the GPU as an example to demonstrate how to design accelerator-enabled communication middleware that harness hardware computational resources and cutting-edge interconnects for high-performance and scalable communication on the modern and next-generation heterogeneous HPC systems. Specifically, this thesis addresses three primary communication patterns: 1) Scalable one-to-all broadcast operations to leverage low-level hardware multicast and GPUDirect RDMA features, 2) Topology-aware, link-efficient, and cooperative GPU-driven schemes significantly accelerate All-to-one and All-to-all reduction operation, i.e., All-reduce, for ML/DL applications, 3) Adaptive CPU-GPU hybrid packing/unpacking with dynamic kernel fusion and zero-copy schemes for non-contiguous data transfer. The proposed scalable broadcast schemes yield 64% performance improvement for the streaming workload on 88 GPUs. The link-efficient Allreduce designs help ML/DL frameworks such as Tensorflow, PyTorch, and Horovod to scale distributed training over 1,536 GPUs on the Summit system. Moreover, it outperforms the state-of-the-art NCCL by up to 1.5X for training image data with the ResNet-50 model using PyTorch. The adaptive MPI derived datatype processing eliminates the expensive packing/unpacking and data movement operations on the dense-GPU systems. Moreover, it achieves up to three orders of magnitude faster than the production libraries for the 3D domain decomposition, a critical method to power various scientific applications such as weather forecast and molecular dynamics simulations. Finally, the proposed designs are made publicly available under the MVAPICH2-GDR library for the HPC community. |