Popis: |
High performance computing systems are used for compute-intensive jobs by multiple users. The users submit jobs to batch queues where the jobs are queued for an unknown amount of time until the required resources are available. A large amount of data (submit time, start time, end time, nodes allocated) is collected about these jobs. Analyzing complex logs of large systems is tedious. It is helpful to automatically analyze the logs in real-time and take reactive measures. In this paper, we present a unified job analysis and prediction system for supercomputer jobs. The users and administrators can monitor the current system state, analyze historical data and predict wait-times of future jobs. We evaluated our wait-time predictors on real job traces from 10 different systems. We observed 92.3% lower average prediction errors, as compared to existing methods. |