Failure analysis and prediction for the CIPRES science gateway

Autor: Lawrence K. Saul, Kritika Singh, Shava Smallen, Sameer Tilak
Rok vydání: 2015
Předmět:
Zdroj: Concurrency and Computation: Practice and Experience. 28:1971-1981
ISSN: 1532-0634
1532-0626
Popis: Science gateways promote collaboration among researchers by providing them with access to community-developed tools and data collections. The Cyberinfrastructure for Phylogenetic Research CIPRES science gateway is one of the most popular gateways, with approximately 3000 active users since 2012, and the user base is growing each year. While increasing the number of compute resources available to CIPRES would address their growth needs, it also introduces additional complexity as the likelihood of failure increases. In this paper, we analyze historical job data from CIPRES and combine it with historical software and services monitoring data to create a machine learning model to predict where a user's job will complete successfully on resources. At one operating point of our classifier, we are able to detect 50% of jobs that will fail with a false detection rate less than 5%. In 2014, accurately predicting 50% of CIPRES job failures and redirecting them to other resources would have resulted in 900K compute core hours saved, furthering phylogenetic research. These statistical models will also be used as a base to build a more generic automated monitoring analysis service for science gateways. Copyright © 2015 John Wiley & Sons, Ltd.
Databáze: OpenAIRE