Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction

Autor: Luca Bonaventura, Mike Gillard, Keita Teranishi, Dominik Göddeke, Peter Düben, Chris D. Cantwell, Erwan Raffin, Tommaso Benacchio, Nils Wedi, Mirco Altenbernd, Luc Giraud
Přispěvatelé: Modeling and Scientific Computing [Milano] (MOX), Politecnico di Milano [Milan] (POLIMI), Institute for Applied Analysis and Numerical Simulation [Stuttgart], University of Stuttgart, Department of Aeronautics, Imperial College London, European Centre for Medium-Range Weather Forecasts (ECMWF), Loughborough University, High-End Parallel Algorithms for Challenging Numerical Simulations (HiePACS), Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Inria Bordeaux - Sud-Ouest, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), Center for Excellence in Performance Programming [Rennes] (CEPP), Atos, Sandia National Laboratories [Livermore], Sandia National Laboratories - Corporation, This work was supported by the ESCAPE-2project, European Union’s Horizon 2020 Research and Innovation Programme (Grant Agreement No. 800897), theESiWACE2 Centre of Excellence, European Union’s Horizon 2020 Research and Innovation Programme (Grant Agreement No. 823988), and the Deutsche Forschungsge-meinschaft under Germany’s Excellence Strategy – EXC-2075 (Grant Agreement No. 390740016)., Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)-Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)-Inria Bordeaux - Sud-Ouest
Jazyk: angličtina
Rok vydání: 2021
Předmět:
Technology
Computer science
Numerical weather prediction
Weather and climate
0805 Distributed Computing
010103 numerical & computational mathematics
02 engineering and technology
Fault-tolerant computing
01 natural sciences
Theoretical Computer Science
Application-level resilience
Computer Science
Theory & Methods

0202 electrical engineering
electronic engineering
information engineering

Mill
0101 mathematics
Computer Science
Hardware & Architecture

Resilience (network)
High-performance computing
020203 distributed computing
Science & Technology
Fault tolerance
Supercomputer
Reliability engineering
13. Climate action
Hardware and Architecture
Computer Science
Computer Science
Interdisciplinary Applications

MPI
Iterative solvers
[INFO.INFO-DC]Computer Science [cs]/Distributed
Parallel
and Cluster Computing [cs.DC]

Distributed Computing
Software
[MATH.MATH-NA]Mathematics [math]/Numerical Analysis [math.NA]
Zdroj: The International Journal of High Performance Computing Applications
International Journal of High Performance Computing Applications
International Journal of High Performance Computing Applications, SAGE Publications, 2021, 35 (4), pp.285-311. ⟨10.1177/1094342021990433⟩
International Journal of High Performance Computing Applications, 2021, 35 (4), pp.285-311. ⟨10.1177/1094342021990433⟩
ISSN: 1094-3420
DOI: 10.1177/1094342021990433⟩
Popis: International audience; Progress in numerical weather and climate prediction accuracy greatly depends on the growth of the available computing power. As the number of cores in top computing facilities pushes into the millions, increased average frequency of hardware and software failures forces users to review their algorithms and systems in order to protect simulations from breakdown. This report surveys hardware, application-level and algorithm-level resilience approaches of particular relevance to timecritical numerical weather and climate prediction systems. A selection of applicable existing strategies is analysed, featuring interpolation-restart and compressed checkpointing for the numerical schemes, in-memory checkpointing, user-level failure mitigation and backup-based methods for the systems. Numerical examples showcase the performance of the techniques in addressing faults, with particular emphasis on iterative solvers for linear systems, a staple of atmospheric fluid flow solvers. The potential impact of these strategies is discussed in relation to current development of numerical weather prediction algorithms and systems towards the exascale. Trade-offs between performance, efficiency and effectiveness of resiliency strategies are analysed and some recommendations outlined for future developments.
Databáze: OpenAIRE