Towards Local-Failure Local-Recovery in PDE Frameworks: The Case of Linear Solvers
Autor: | Mirco Altenbernd, Dominik Göddeke, Nils-Arne Dreier, Christian Engwer |
---|---|
Rok vydání: | 2021 |
Předmět: |
020203 distributed computing
Mean time between failures business.industry Computer science Fault tolerance 010103 numerical & computational mathematics 02 engineering and technology Parallel computing Lossy compression Solver 01 natural sciences Task (project management) Software Component (UML) 0202 electrical engineering electronic engineering information engineering Leverage (statistics) 0101 mathematics business |
Zdroj: | Lecture Notes in Computer Science ISBN: 9783030670764 HPCSE |
DOI: | 10.1007/978-3-030-67077-1_2 |
Popis: | It is expected that with the appearance of exascale supercomputers the mean time between failure in supercomputers will decrease. Classical checkpoint-restart approaches are too expensive at that scale. Local-failure local-recovery (LFLR) strategies are an option that promises to leverage the costs, but actually implementing it into any sufficiently large simulation environment is a challenging task. In this paper we discuss how LFLR methods can be incorporated in a PDE framework, focussing at the linear solvers as the innermost component. We discuss how Krylov solvers can be modified to support LFLR, and present numerical tests. We exemplify our approach by reporting on the implementation of these features in the Dune framework, present C++ software abstractions, which simplify the incorporation of LFLR techniques and show how we use these in our solver library. To reduce the memory costs of full remote backups, we further investigate the benefits of lossy compression and in-memory checkpointing. |
Databáze: | OpenAIRE |
Externí odkaz: |