A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems

Autor:	Kang Yu, Jian Gao, Peng Qing, Hong-Mei Wei
Rok vydání:	2017
Předmět:	Job scheduler 021110 strategic defence & security studies business.industry Computer science Distributed computing 0211 other engineering and technologies Context (language use) 02 engineering and technology computer.software_genre Supercomputer Fault (power engineering) Fault detection and isolation Theoretical Computer Science Tree (data structure) Middleware Embedded system Scalability business computer Software Information Systems
Zdroj:	International Journal of Parallel Programming. 46:749-761
ISSN:	1573-7640 0885-7458
Popis:	Fault localization has become an increasingly challenging issue in high-performance computing (HPC) systems. Various techniques have been used for HPC systems. However, as the HPC systems scale out, resulting in the rapid deterioration of the existing techniques. In this context, we propose a message-passing based fault localization framework, namely MPFL, which provides a light-weight distributed service using tree-based fault detection (TFD) and fault analysis (TFA) algorithms. In essence, MPFL serves as a fault localization engine within message-passing libraries by enabling several system middleware such as job scheduler to provide abnormal information. We present details of the MPFL framework, including the implementation of TFD and TFA. Further, we develop the fault localization engine prototype within MVAPICH2. The experimental evaluation is performed on a typical HPC cluster with 10 computing nodes, which demonstrate the capability of MPFL and show that the MPFL service does not affect the performance of an application in practice.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::9a7b18b27cc203354e773311a7c5a344 https://doi.org/10.1007/s10766-017-0526-x Zobrazit plný text záznamu Full text from SpringerLink