Algorithm-based fault-tolerant parallel sorting

Autor: Camargo, Edson T., Junior, Elias P. Duarte
Zdroj: International Journal of Critical Computer-Based Systems; 2024, Vol. 11 Issue: 1 p2-29, 28p
Abstrakt: High performance computing (HPC) systems often require substantial resources, and can take up to several hours or days to execute. Upon a failure, it is important to loose as little computation as possible. In this work we present an algorithm-based fault tolerance (ABFT) strategy for hypercube-based parallel algorithms. The strategy assumes the virtual VCube topology, which has several logarithmic properties that are preserved even as nodes fail. The strategy guarantees that the algorithm does not halt even after up to (N- 1) nodes crash, in a system of Nnodes. We use parallel sorting as a case study, describing how to make a fault-tolerant version of three parallel sorting algorithms: HyperQuickSort, QuickMerge and Bitonic Sort. The algorithms were implemented in MPI using ULMF to handle faults. Experimental results are presented showing the performance and robustness of the solution for sorting up to a billion integers in scenarios with faults.
Databáze: Supplemental Index