Software approaches for resilience of high performance computing systems: a survey.

Autor: Jia, Jie, Liu, Yi, Zhang, Guozhen, Gao, Yulin, Qian, Depei
Zdroj: Frontiers of Computer Science; Aug2023, Vol. 17 Issue 4, p1-15, 15p
Abstrakt: With the scaling up of high-performance computing systems in recent years, their reliability has been descending continuously. Therefore, system resilience has been regarded as one of the critical challenges for large-scale HPC systems. Various techniques and systems have been proposed to ensure the correct execution and completion of parallel programs. This paper provides a comprehensive survey of existing software resilience approaches. Firstly, a classification of software resilience approaches is presented; then we introduce major approaches and techniques, including checkpointing, replication, soft error resilience, algorithm-based fault tolerance, fault detection and prediction. In addition, challenges exposed by system-scale and heterogeneous architecture are also discussed. [ABSTRACT FROM AUTHOR]
Databáze: Complementary Index