Challenges in Developing MPI Fault-Tolerant Fortran Applications

Autor:	James P. Vary, Glenn R. Luecke, Nathan T. Weeks, Pieter Maris
Rok vydání:	2018
Předmět:	020203 distributed computing Computer science Fortran Process (engineering) Overhead (engineering) Initialization Fault tolerance 02 engineering and technology Supercomputer computer.software_genre 020204 information systems 0202 electrical engineering electronic engineering information engineering Operating system computer computer.programming_language
Zdroj:	CLUSTER
Popis:	Powerful high performance computing systems of the future are expected to have higher failure rates than current systems. As a result, HPC applications running on such future systems are more likely to encounter a system failure than on today's machines. Application fault tolerance is therefore becoming more important to avoid costly waste of resources associated with rerunning failed applications. The MPI 3.1 standard does not address the issue of MPI process failures. Checkpoint/restart is commonly used to add fault tolerance to MPI applications. However, there can be complicated issues impacting an MPI application's ability to correctly and efficiently write checkpoint files, particularly if Fortran I/O statements are used. Moreover, it may be inefficient restart a large number MPI processes from a checkpoint. Several MPI fault tolerance libraries, such as ULFM, are being developed to enabl MPI programs to recover from MPI process failures. This can circumvent much of the overhead of an application restart, including rescheduling, launching, initializing, and reading checkpoint data. Each library uses a different approach to recovery from MPI process failures. Unfortunately, some of the proposed recovery models are incompatible with Fortran. This paper intends to help Fortran MPI application developers avoid problems when developing fault-tolerant MPI applications.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::ce7a2f1c7af62a75d0cd8cfd604a780f https://doi.org/10.1109/cluster.2018.00068 Zobrazit plný text záznamu