Early Termination of Failed HPC Jobs Through Machine and Deep Learning
Autor: | Marc Solé, Michal Zasadzinski, David Carrera, Victor Muntés-Mulero, Thomas Ludwig |
---|---|
Rok vydání: | 2018 |
Předmět: |
Job scheduler
business.industry Computer science CPU time 010103 numerical & computational mathematics 02 engineering and technology Energy consumption computer.software_genre Supercomputer 01 natural sciences Petascale computing 020204 information systems 0202 electrical engineering electronic engineering information engineering Operating system Data center 0101 mathematics business computer Operating cost |
Zdroj: | Euro-Par 2018: Parallel Processing-24th International Conference on Parallel and Distributed Computing, Turin, Italy, August 27-31, 2018, Proceedings Euro-Par 2018: Parallel Processing ISBN: 9783319969824 Euro-Par Lecture Notes in Computer Science Lecture Notes in Computer Science-Euro-Par 2018: Parallel Processing |
ISSN: | 0302-9743 1611-3349 |
DOI: | 10.1007/978-3-319-96983-1_12 |
Popis: | Failed jobs in a supercomputer cause not only waste in CPU time or energy consumption but also decrease work efficiency of users. Mining data collected during the operation of data centers helps to find patterns explaining failures and can be used to predict them. Automating system reactions, e.g., early termination of jobs, when software failures are predicted does not only increase availability and reduce operating cost, but it also frees administrators’ and users’ time. In this paper, we explore a unique dataset containing the topology, operation metrics, and job scheduler history from the petascale Mistral supercomputer. We extract the most relevant system features deciding on the final state of a job through decision trees. Then, we successfully train a neural net to predict job evolution based on power time series of nodes. Finally, we evaluate the effect on CPU time saving for static and dynamic job termination policies. |
Databáze: | OpenAIRE |
Externí odkaz: |