Desh
Autor: | Anwesha Das, Abhinav Vishnu, Charles Siegel, Frank Mueller |
---|---|
Rok vydání: | 2018 |
Předmět: |
020203 distributed computing
Exploit business.industry Computer science Deep learning Node (networking) Context (language use) 02 engineering and technology Machine learning computer.software_genre Recurrent neural network Component-based software engineering 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Anomaly detection Artificial intelligence business computer Lead time |
Zdroj: | HPDC |
DOI: | 10.1145/3208040.3208051 |
Popis: | Today's large-scale supercomputers encounter faults on a daily basis. Exascale systems are likely to experience even higher fault rates due to increased component count and density. Triggering resilience-mitigating techniques remains a challenge due to the absence of well defined failure indicators. System logs consist of unstructured text that obscures essential system health information contained within. In this context, efficient failure prediction via log mining can enable proactive recovery mechanisms to increase reliability.This work aims to predict node failures that occur in supercomputing systems via long short-term memory (LSTM) networks that exploit recurrent neural networks (RNNs). Our framework, Desh1 (Deep Learning for System Health), diagnoses and predicts failures with short lead times. Desh identifies failure indicators with enhanced training and classification for generic applicability to logs from operating systems and software components without the need to modify any of them. Desh uses a novel three-phase deep learning approach to (1) train to recognize chains of log events leading to a failure, (2) re-train chain recognition of events augmented with expected lead times to failure, and (3) predict lead times during testing/inference deployment to predict which specific node fails in how many minutes. Desh obtains as high as 3 minutes average lead time with no less than 85% recall and 83% accuracy to take proactive actions on the failing nodes, which could be used to migrate computation to healthy nodes. |
Databáze: | OpenAIRE |
Externí odkaz: |