Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing
Autor: | Dirk Ribbrock, Mirco Altenbernd, Dominik Göddeke |
---|---|
Rok vydání: | 2015 |
Předmět: |
Computer Networks and Communications
Computer science Fault tolerance Parallel computing Data loss Supercomputer Computer Graphics and Computer-Aided Design Finite element method Theoretical Computer Science Multigrid method Artificial Intelligence Hardware and Architecture Asynchronous communication Software Data compression |
Zdroj: | Parallel Computing. 49:117-135 |
ISSN: | 0167-8191 |
DOI: | 10.1016/j.parco.2015.07.003 |
Popis: | Fault-tolerant and robust multigrid methods.Hierarchical finite element compression.Asynchronous checkpointing with local restart. We analyse novel fault tolerance schemes for data loss in multigrid solvers, which essentially combine ideas of checkpoint-restart with algorithm-based fault tolerance. To improve efficiency compared to conventional global checkpointing, we exploit the inherent data compression of the multigrid hierarchy, and relax the synchronicity requirement through a local failure local recovery approach. We experimentally identify the root cause of convergence degradation in the presence of data loss using smoothness considerations. Our resulting schemes form a family of techniques that can be tailored to the expected error probability of (future) large-scale machines. A performance model gives further insight into the benefits and applicability of our techniques. |
Databáze: | OpenAIRE |
Externí odkaz: |