Challenges in Developing MPI Fault-Tolerant Fortran Applications

Autor: James P. Vary, Glenn R. Luecke, Nathan T. Weeks, Pieter Maris
Rok vydání: 2018
Předmět:
Zdroj: CLUSTER
Popis: Powerful high performance computing systems of the future are expected to have higher failure rates than current systems. As a result, HPC applications running on such future systems are more likely to encounter a system failure than on today's machines. Application fault tolerance is therefore becoming more important to avoid costly waste of resources associated with rerunning failed applications. The MPI 3.1 standard does not address the issue of MPI process failures. Checkpoint/restart is commonly used to add fault tolerance to MPI applications. However, there can be complicated issues impacting an MPI application's ability to correctly and efficiently write checkpoint files, particularly if Fortran I/O statements are used. Moreover, it may be inefficient restart a large number MPI processes from a checkpoint. Several MPI fault tolerance libraries, such as ULFM, are being developed to enabl MPI programs to recover from MPI process failures. This can circumvent much of the overhead of an application restart, including rescheduling, launching, initializing, and reading checkpoint data. Each library uses a different approach to recovery from MPI process failures. Unfortunately, some of the proposed recovery models are incompatible with Fortran. This paper intends to help Fortran MPI application developers avoid problems when developing fault-tolerant MPI applications.
Databáze: OpenAIRE