Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction
Autor: | Luca Bonaventura, Mike Gillard, Keita Teranishi, Dominik Göddeke, Peter Düben, Chris D. Cantwell, Erwan Raffin, Tommaso Benacchio, Nils Wedi, Mirco Altenbernd, Luc Giraud |
---|---|
Přispěvatelé: | Modeling and Scientific Computing [Milano] (MOX), Politecnico di Milano [Milan] (POLIMI), Institute for Applied Analysis and Numerical Simulation [Stuttgart], University of Stuttgart, Department of Aeronautics, Imperial College London, European Centre for Medium-Range Weather Forecasts (ECMWF), Loughborough University, High-End Parallel Algorithms for Challenging Numerical Simulations (HiePACS), Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Inria Bordeaux - Sud-Ouest, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), Center for Excellence in Performance Programming [Rennes] (CEPP), Atos, Sandia National Laboratories [Livermore], Sandia National Laboratories - Corporation, This work was supported by the ESCAPE-2project, European Union’s Horizon 2020 Research and Innovation Programme (Grant Agreement No. 800897), theESiWACE2 Centre of Excellence, European Union’s Horizon 2020 Research and Innovation Programme (Grant Agreement No. 823988), and the Deutsche Forschungsge-meinschaft under Germany’s Excellence Strategy – EXC-2075 (Grant Agreement No. 390740016)., Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)-Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)-Inria Bordeaux - Sud-Ouest |
Jazyk: | angličtina |
Rok vydání: | 2021 |
Předmět: |
Technology
Computer science Numerical weather prediction Weather and climate 0805 Distributed Computing 010103 numerical & computational mathematics 02 engineering and technology Fault-tolerant computing 01 natural sciences Theoretical Computer Science Application-level resilience Computer Science Theory & Methods 0202 electrical engineering electronic engineering information engineering Mill 0101 mathematics Computer Science Hardware & Architecture Resilience (network) High-performance computing 020203 distributed computing Science & Technology Fault tolerance Supercomputer Reliability engineering 13. Climate action Hardware and Architecture Computer Science Computer Science Interdisciplinary Applications MPI Iterative solvers [INFO.INFO-DC]Computer Science [cs]/Distributed Parallel and Cluster Computing [cs.DC] Distributed Computing Software [MATH.MATH-NA]Mathematics [math]/Numerical Analysis [math.NA] |
Zdroj: | The International Journal of High Performance Computing Applications International Journal of High Performance Computing Applications International Journal of High Performance Computing Applications, SAGE Publications, 2021, 35 (4), pp.285-311. ⟨10.1177/1094342021990433⟩ International Journal of High Performance Computing Applications, 2021, 35 (4), pp.285-311. ⟨10.1177/1094342021990433⟩ |
ISSN: | 1094-3420 |
DOI: | 10.1177/1094342021990433⟩ |
Popis: | International audience; Progress in numerical weather and climate prediction accuracy greatly depends on the growth of the available computing power. As the number of cores in top computing facilities pushes into the millions, increased average frequency of hardware and software failures forces users to review their algorithms and systems in order to protect simulations from breakdown. This report surveys hardware, application-level and algorithm-level resilience approaches of particular relevance to timecritical numerical weather and climate prediction systems. A selection of applicable existing strategies is analysed, featuring interpolation-restart and compressed checkpointing for the numerical schemes, in-memory checkpointing, user-level failure mitigation and backup-based methods for the systems. Numerical examples showcase the performance of the techniques in addressing faults, with particular emphasis on iterative solvers for linear systems, a staple of atmospheric fluid flow solvers. The potential impact of these strategies is discussed in relation to current development of numerical weather prediction algorithms and systems towards the exascale. Trade-offs between performance, efficiency and effectiveness of resiliency strategies are analysed and some recommendations outlined for future developments. |
Databáze: | OpenAIRE |
Externí odkaz: |