A Skeletal-Based Approach for the Development of Fault-Tolerant SPMD Applications
Autor: | Stéphane Vialle, Virginie Galtier, Constantinos Makassikis |
---|---|
Přispěvatelé: | Algorithms for the Grid (ALGORILLE), INRIA Lorraine, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université Henri Poincaré - Nancy 1 (UHP)-Université Nancy 2-Institut National Polytechnique de Lorraine (INPL)-Centre National de la Recherche Scientifique (CNRS)-Université Henri Poincaré - Nancy 1 (UHP)-Université Nancy 2-Institut National Polytechnique de Lorraine (INPL)-Centre National de la Recherche Scientifique (CNRS), SUPELEC-Campus Metz, Ecole Supérieure d'Electricité - SUPELEC (FRANCE) |
Jazyk: | angličtina |
Rok vydání: | 2010 |
Předmět: |
Computer science
Distributed computing SPMD 020206 networking & telecommunications Fault tolerance 02 engineering and technology Parallel computing Structuring framework 020204 information systems Software fault tolerance application-level checkpointing programming skeletons 0202 electrical engineering electronic engineering information engineering Overhead (computing) Algorithmic skeleton fault tolerance Routing (electronic design automation) [INFO.INFO-DC]Computer Science [cs]/Distributed Parallel and Cluster Computing [cs.DC] Programmer |
Zdroj: | The 11th International Conference on Parallel and Distributed Computing, Applications and Technologies-PDCAT 2010 The 11th International Conference on Parallel and Distributed Computing, Applications and Technologies-PDCAT 2010, Dec 2010, Wuhan, China. ⟨10.1109/PDCAT.2010.89⟩ PDCAT |
DOI: | 10.1109/PDCAT.2010.89⟩ |
Popis: | International audience; Distributing applications over PC clusters to speed-up or size-up the execution is now commonplace. Yet efficiently tolerating faults of these systems is a major issue. To ease the addition of checkpoint-based fault tolerance at the application level, we introduce a Model for Low-Overhead Tolerance of Faults (MoLOToF) which is based on structuring applications using fault-tolerant skeletons. MoLOToF also encourages collaborations with the programmer and the execution environment. The skeletons are adapted to specific parallelization paradigms and yield what can be called fault-tolerant algorithmic skeletons. The application of MoLOToF to the SPMD parallelization paradigm results in our proposed FT-SPMD framework. Experiments show that the complexity for developing an application is small and the use of the framework has a small impact on performance. Comparisons with existing system-level checkpoint solutions, namely LAM/MPI and DMTCP, point out that FT-SPMD has a lower runtime overhead while being more robust when a higher level of fault tolerance is required. |
Databáze: | OpenAIRE |
Externí odkaz: |