Failure analysis and prediction for the CIPRES science gateway
Autor: | Lawrence K. Saul, Kritika Singh, Shava Smallen, Sameer Tilak |
---|---|
Rok vydání: | 2015 |
Předmět: |
020203 distributed computing
Computer Networks and Communications business.industry Computer science 02 engineering and technology Science gateway Data science Computer Science Applications Theoretical Computer Science World Wide Web Software Cyberinfrastructure Computational Theory and Mathematics 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing business Classifier (UML) |
Zdroj: | Concurrency and Computation: Practice and Experience. 28:1971-1981 |
ISSN: | 1532-0634 1532-0626 |
Popis: | Science gateways promote collaboration among researchers by providing them with access to community-developed tools and data collections. The Cyberinfrastructure for Phylogenetic Research CIPRES science gateway is one of the most popular gateways, with approximately 3000 active users since 2012, and the user base is growing each year. While increasing the number of compute resources available to CIPRES would address their growth needs, it also introduces additional complexity as the likelihood of failure increases. In this paper, we analyze historical job data from CIPRES and combine it with historical software and services monitoring data to create a machine learning model to predict where a user's job will complete successfully on resources. At one operating point of our classifier, we are able to detect 50% of jobs that will fail with a false detection rate less than 5%. In 2014, accurately predicting 50% of CIPRES job failures and redirecting them to other resources would have resulted in 900K compute core hours saved, furthering phylogenetic research. These statistical models will also be used as a base to build a more generic automated monitoring analysis service for science gateways. Copyright © 2015 John Wiley & Sons, Ltd. |
Databáze: | OpenAIRE |
Externí odkaz: |