Practice and Experience in using Parallel and Scalable Machine Learning with Heterogenous Modular Supercomputing Architectures

Autor: Gabriele Cavallaro, Petur Einarsson, Matthias Book, Helmut Neukirchen, Chadi Barakat, Morris Riedel, Andreas Lintermann, Reza Hassanian, Rocco Sedona
Rok vydání: 2021
Předmět:
Zdroj: 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
IEEE 76-85 (2021). doi:10.1109/IPDPSW52791.2021.00019
IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Portland, USA, 2021-06-17-2021-06-21
IPDPS Workshops
DOI: 10.1109/ipdpsw52791.2021.00019
Popis: We observe a continuously increased use of Deep Learning (DL) as a specific type of Machine Learning (ML) for data-intensive problems (i.e., ’big data’) that requires powerful computing resources with equally increasing performance. Consequently, innovative heterogeneous High-Performance Computing (HPC) systems based on multi-core CPUs and many-core GPUs require an architectural design that addresses end user communities’ requirements that take advantage of ML and DL. Still the workloads of end user communities of the simulation sciences (e.g., using numerical methods based on known physical laws) needs to be equally supported in those architectures. This paper offers insights into the Modular Supercomputer Architecture (MSA) developed in the Dynamic Exascale Entry Platform (DEEP) series of projects to address the requirements of both simulation sciences and data-intensive sciences such as High Performance Data Analytics (HPDA). It shares insights into implementing the MSA in the Julich Supercomputing Centre (JSC) hosting Europe No. 1 Supercomputer Julich Wizard for European Leadership Science (JUWELS). We augment the technical findings with experience and lessons learned from two application communities case studies (i.e., remote sensing and health sciences) using the MSA with JUWELS and the DEEP systems in practice. Thus, the paper provides details into specific MSA design elements that enable significant performance improvements of ML and DL algorithms. While this paper focuses on MSA-based HPC systems and application experience, we are not losing sight of advances in Cloud Computing (CC) and Quantum Computing (QC) relevant for ML and DL.
Databáze: OpenAIRE