Revisiting the double checkpointing algorithm
Autor: | Yves Robert, Thomas Herault, Jack Dongarra |
---|---|
Přispěvatelé: | Innovative Computing Laboratory [Knoxville] (ICL), The University of Tennessee [Knoxville], Laboratoire de l'Informatique du Parallélisme (LIP), École normale supérieure - Lyon (ENS Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS), Optimisation des ressources : modèles, algorithmes et ordonnancement (ROMA), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire de l'Informatique du Parallélisme (LIP), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Lyon (ENS Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Centre National de la Recherche Scientifique (CNRS), IEEE, ANR-10-BLAN-0301,RESCUE,Résilience des applications scientifiques sur machines exascales(2010), École normale supérieure de Lyon (ENS de Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure de Lyon (ENS de Lyon)-Université Claude Bernard Lyon 1 (UCBL) |
Jazyk: | angličtina |
Rok vydání: | 2013 |
Předmět: |
Computer science
ACM: G.: Mathematics of Computing/G.3: PROBABILITY AND STATISTICS/G.3.10: Reliability and life testing 02 engineering and technology ACM: G.: Mathematics of Computing 020202 computer hardware & architecture Scheduling (computing) ACM: C.: Computer Systems Organization/C.4: PERFORMANCE OF SYSTEMS ACM: F.: Theory of Computation/F.2: ANALYSIS OF ALGORITHMS AND PROBLEM COMPLEXITY/F.2.2: Nonnumerical Algorithms and Problems 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Double check Stable storage [INFO.INFO-DC]Computer Science [cs]/Distributed Parallel and Cluster Computing [cs.DC] Algorithm Performance model |
Zdroj: | APDCM 2013 APDCM 2013, IEEE, 2013, Boston, United States IPDPS Workshops [Research Report] RR-8196, 2012 |
Popis: | International audience; Fast checkpointing algorithms require distributed access to stable storage. This paper revisits the approach base upon double checkpointing, and compares the blocking algorithm of Zheng, Shi and Kalé, with the non-blocking algorithm of Ni, Meneses and Kalé in terms of both performance and risk. We also extend the model that they have proposed to assess the impact of the overhead associated to non-blocking communications. We then provide a new peer-to-peer checkpointing algorithm, called the triple checkpointing algorithm, that can work at constant memory, and achieves both higher efficiency and better risk handling than the double checkpointing algorithm. We provide performance and risk models for all the evaluated protocols, and compare them through comprehensive simulations.; Voir le résumé en anglais. |
Databáze: | OpenAIRE |
Externí odkaz: |