Comparative performance study of TSMP under homogeneous, heterogeneous and modular configurations

Autor: Daniel Caviedes-Voullième, Jörg Benke, Ghazal Tashakor, Ilya Zhukov, Stefan Poll
Rok vydání: 2023
DOI: 10.5194/egusphere-egu23-7286
Popis: The compartmentalised (modular) design often found in multiphysics Earth system models allows for progressive offloading of compute-intensive kernels to accelerators. The nature of this process implies that some model components will run on accelerators, while other components will continue to run on CPUs, leading to the use of heterogeneous HPC architectures. Furthermore, different hardware architectures (e.g. CPUs, GPUs, quantum, neuromorphic) within an HPC system can be grouped into modules, each tailored to the requirements of a particular class of algorithms and software, and interconnected with other modules via a shared network. Some of these modules may be focused on energy-efficient scalability, whereas others may be disruptive and experimental. Such a conglomerate of different hardware modules, where each module can work stand-alone or in combination with other modules, leads to the idea of modular supercomputer architecture (MSA). The first exascale system in Europe (JUPITER) is expected to be modular, following on from the experience of the JUWELS system. This new paradigm poses questions on how performance and scalability of models change from homogeneous, to heterogeneous to modular systems. The Terrestrial Systems Modelling Platform (TSMP) is a scale-consistent, highly modular, massively parallel, fully integrated soil-vegetation-atmosphere modelling system coupling an atmospheric model (COSMO), a land surface model (CLM), and a hydrological model (ParFlow), linked together by means of the OASIS3-MCT library. Each of these submodels can be considered as a module, with different domain sizes, computational loads and scalability. This implies that optimal configurations for solving a given problem require understanding many levels of non-trivial load balancing. It is currently possible to offload ParFlow to GPUs, while keeping COSMO and CLM on CPUs. This enables both heterogeneous and modular configuration, and thus prompts the need to re-evaluate load distribution and scalability to find new optimal configurations. In a previous study, preliminary results on heterogeneous configurations were presented (https://doi.org/10.5194/egusphere-egu22-10006).In this contribution, we extend our study and present a comparative study of performance and scaling for homogeneous, heterogeneous, and modular TSMP jobs. We study strong and weak scaling, for different problem sizes, and evaluate parallel efficiency on all three configurations in the JUWELS supercomputer. We further explore traces of selected cases, to identify changes in behaviour under the different configurations, such as emergent MPI communication bottlenecks and root causes of the load balancing issues.
Databáze: OpenAIRE