Resilient Scheduling of Moldable Parallel Jobs to Cope with Silent Errors

Autor: Hongyang Sun, Padma Raghavan, Lucas Perotin, Anne Benoit, Valentin Le Fèvre, Yves Robert
Přispěvatelé: Optimisation des ressources : modèles, algorithmes et ordonnancement (ROMA), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire de l'Informatique du Parallélisme (LIP), École normale supérieure de Lyon (ENS de Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure de Lyon (ENS de Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Centre National de la Recherche Scientifique (CNRS), Laboratoire de l'Informatique du Parallélisme (LIP), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS), Barcelona Supercomputing Center - Centro Nacional de Supercomputacion (BSC - CNS), Vanderbilt University [Nashville], Innovative Computing Laboratory [Knoxville] (ICL), The University of Tennessee [Knoxville], University of Kansas [Kansas City], École normale supérieure - Lyon (ENS Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Lyon (ENS Lyon)-Université Claude Bernard Lyon 1 (UCBL)
Jazyk: angličtina
Rok vydání: 2021
Předmět:
Zdroj: IEEE Transactions on Computers
IEEE Transactions on Computers, 2021, pp.1-14. ⟨10.1109/TC.2021.3104747⟩
IEEE Transactions on Computers, Institute of Electrical and Electronics Engineers, 2021, pp.14. ⟨10.1109/TC.2021.3104747⟩
IEEE Transactions on Computers, Institute of Electrical and Electronics Engineers, 2021, pp.1-14. ⟨10.1109/TC.2021.3104747⟩
ISSN: 0018-9340
DOI: 10.1109/TC.2021.3104747⟩
Popis: International audience; We study the resilient scheduling of moldable parallel jobs on high-performance computing (HPC) platforms. Moldable jobs allow for choosing a processor allocation before execution, and their execution time obeys various speedup models. The objective is to minimize the overall completion time of the jobs, or the makespan, when jobs can fail due to silent errors and hence may need to be re-executed after each failure until successful completion. Our work generalizes the classical scheduling framework for failure-free jobs. To cope with silent errors, we introduce two resilient scheduling algorithms, LPA-List and Batch-List, both of which use the List strategy to schedule the jobs. Without knowing a priori how many times each job will fail, LPA-List relies on a local strategy to allocate processors to the jobs, while Batch-List schedules the jobs in batches and allows only a restricted number of failures per job in each batch. We prove new approximation ratios for the two algorithms under several prominent speedup models (e.g., roofline, communication, Amdahl, power, monotonic, and a mixed model). An extensive set of simulations is conducted to evaluate different variants of the two algorithms, and the results show that they consistently outperform some baseline heuristics. Overall, our best algorithm is within a factor of 1.6 of a lower bound on average over the entire set of experiments, and within a factor of 4.2 in the worst case.
Databáze: OpenAIRE