Autor: Kevin D. Colby, Amiya K. Maji, Joseph Bottum, Jason Rahman
Rok vydání: 2017
Předmět:
Zdroj: Proceedings of the Fourth International Workshop on HPC User Support Tools.
DOI: 10.1145/3152493.3152555
Popis: HPC systems are made of many complex hardware and software components, and interaction between these components can often break, leading to job failures and customer dissatisfaction. Testing focused on individual components is often inadequate to identify broken inter-component interactions, therefore, to detect and avoid these, a holistic testing framework is needed which can test the full functionality and performance of a cluster from a user's perspective. Existing tools for HPC cluster testing are either rigid (i.e. works within the context of a single cluster) or are focused on system components (i.e., OS and middleware). In this paper, we present Testpilot---a flexible, holistic, and user-centric testing framework which can be used by system administrators, support staff, or even by users themselves. Testpilot can be used in various testing scenarios such as application testing, application update, OS update, or for continuous monitoring of cluster health. The authors have found Testpilot to be invaluable for regression testing at their HPC site and it has caught many issues that would have otherwise gone into production unnoticed.
Databáze: OpenAIRE