Towards Local-Failure Local-Recovery in PDE Frameworks: The Case of Linear Solvers

Autor: Mirco Altenbernd, Dominik Göddeke, Nils-Arne Dreier, Christian Engwer
Rok vydání: 2021
Předmět:
Zdroj: Lecture Notes in Computer Science ISBN: 9783030670764
HPCSE
DOI: 10.1007/978-3-030-67077-1_2
Popis: It is expected that with the appearance of exascale supercomputers the mean time between failure in supercomputers will decrease. Classical checkpoint-restart approaches are too expensive at that scale. Local-failure local-recovery (LFLR) strategies are an option that promises to leverage the costs, but actually implementing it into any sufficiently large simulation environment is a challenging task. In this paper we discuss how LFLR methods can be incorporated in a PDE framework, focussing at the linear solvers as the innermost component. We discuss how Krylov solvers can be modified to support LFLR, and present numerical tests. We exemplify our approach by reporting on the implementation of these features in the Dune framework, present C++ software abstractions, which simplify the incorporation of LFLR techniques and show how we use these in our solver library. To reduce the memory costs of full remote backups, we further investigate the benefits of lossy compression and in-memory checkpointing.
Databáze: OpenAIRE