From tasks graphs to asynchronous distributed checkpointing with local restart

Autor:	Samuel Thibault, Romain Lion
Přispěvatelé:	STatic Optimizations, Runtime Methods (STORM), Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Inria Bordeaux - Sud-Ouest, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), Université de Bordeaux (UB), Plafrim, European Project: 801015,H2020,EXA2PRO(2018), Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)-Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)-Inria Bordeaux - Sud-Ouest
Jazyk:	angličtina
Rok vydání:	2020
Předmět:	Scheme (programming language) Buddy in-memory 020203 distributed computing Computer science Distributed computing Node (networking) Fault tolerance 02 engineering and technology Fault (power engineering) Task-based programming Checkpoint-restart Task (computing) Asynchronous communication 020204 information systems Scalability 0202 electrical engineering electronic engineering information engineering [INFO.INFO-DC]Computer Science [cs]/Distributed Parallel and Cluster Computing [cs.DC] computer computer.programming_language Data transmission
Zdroj:	FTXS 2020-IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale FTXS 2020-IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale, Nov 2020, Atlanta / Virtual, United States. ⟨10.1109/FTXS51974.2020.00009⟩ FTXS@SC
Popis:	International audience; The ever-increasing number of computation units assembled in current HPC platforms leads to a concerning increase in fault probability. Traditional checkpoint/restart strategies avoid wasting large amounts of computation time when such fault occurs. With the increasing amount of data dealt with by current applications, these strategies however suffer from their data transfer demand becoming unreasonable, or the entailed global synchronizations. Meanwhile, the current trend towards task-based programming is an opportunity to revisit the principles of the checkpoint/restart strategies. We here propose a checkpointing scheme which is closely tied to the execution of task graphs. We describe how it allows for completely asynchronous and distributed checkpointing, as well as localized node restart, thus opening up for very large scalability. We also show how a synergy between the application data transfers and the checkpointing transfers can lead to a reasonable additional network load, measured to be lower than +10% on a dense linear algebra example.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::3c4cb4ce1301840e666697cbb4ee77c4 https://hal.archives-ouvertes.fr/hal-02970529v2/file/2020001221.pdf Zobrazit plný text záznamu