Coping with silent and fail-stop errors at scale by combining replication and checkpointing

Autor:	Aurélien Cavelan, Hongyang Sun, Padma Raghavan, Yves Robert, Franck Cappello, Anne Benoit
Přispěvatelé:	Optimisation des ressources : modèles, algorithmes et ordonnancement (ROMA), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire de l'Informatique du Parallélisme (LIP), École normale supérieure de Lyon (ENS de Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure de Lyon (ENS de Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Centre National de la Recherche Scientifique (CNRS), Laboratoire de l'Informatique du Parallélisme (LIP), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS), Argonne National Laboratory [Lemont] (ANL), Vanderbilt University [Nashville], École normale supérieure - Lyon (ENS Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Lyon (ENS Lyon)-Université Claude Bernard Lyon 1 (UCBL)
Jazyk:	angličtina
Rok vydání:	2018
Předmět:	Computer Networks and Communications Computer science Distributed computing Fault tolerance 02 engineering and technology Supercomputer Process replication Theoretical Computer Science Artificial Intelligence Hardware and Architecture 020204 information systems 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing [INFO.INFO-DC]Computer Science [cs]/Distributed Parallel and Cluster Computing [cs.DC] Algorithm Software
Zdroj:	Journal of Parallel and Distributed Computing Journal of Parallel and Distributed Computing, 2018, 122, pp.209-225. ⟨10.1016/j.jpdc.2018.08.002⟩ Journal of Parallel and Distributed Computing, Elsevier, 2018, 122, pp.209-225. ⟨10.1016/j.jpdc.2018.08.002⟩
ISSN:	0743-7315 1096-0848
DOI:	10.1016/j.jpdc.2018.08.002⟩
Popis:	International audience; This paper provides a model and an analytical study of replication as a technique to cope with silent errors, as well as a mixture of both silent and fail-stop errors on large-scale platforms. Compared with fail-stop errors that are immediately detected when they occur, silent errors require a detection mechanism. To detect silent errors, many application-specific techniques are available, either based on algorithms (e.g., ABFT), invariant preservation or data analytics, but replication remains the most transparent and least intrusive technique. We explore the right level (duplication, triplication or more) of replication for two frameworks: (i) when the platform is subject to only silent errors, and (ii) when the platform is subject to both silent and fail-stop errors. A higher level of replication is more expensive in terms of resource usage but enables to tolerate more errors and to even correct some errors, hence there is a trade-off to be found. Replication is combined with checkpointing and comes with two flavors: process replication and group replication. Process replication applies to message-passing applications with communicating processes. Each process is replicated, and the platform is composed of process pairs, or triplets. Group replication applies to black-box applications, whose parallel execution is replicated several times. The platform is partitioned into two halves (or three thirds). In both scenarios, results are compared before each checkpoint, which is taken only when both results (duplication) or two out of three results (triplication) coincide. Otherwise, one or more silent errors have been detected, and the application rolls back to the last checkpoint, as well as when fail-stop errors have struck. We provide a detailed analytical study for all of these scenarios, with formulas to decide, for each scenario, the optimal parameters as a function of the error rate, checkpoint cost, and platform size. We also report a set of extensive simulation results that nicely corroborates the analytical model.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::4ef06e9894c20afa7bdcfeb34dae0552 https://inria.hal.science/hal-02082389/file/jpdc_revision.pdf Zobrazit plný text záznamu Full Text from ScienceDirect