Triage
Autor: | Yuanyuan Zhou, Chengdu Huang, Spiros Xanthos, Shan Lu, Joseph Tucek |
---|---|
Rok vydání: | 2007 |
Předmět: |
business.industry
Computer science media_common.quotation_subject computer.software_genre Triage Reliability engineering Software Debugging Server Operating system Overhead (computing) General Earth and Planetary Sciences Medical diagnosis business Programmer computer Protocol (object-oriented programming) media_common General Environmental Science |
Zdroj: | SOSP |
ISSN: | 0163-5980 |
DOI: | 10.1145/1323293.1294275 |
Popis: | Diagnosing production run failures is a challenging yet importanttask. Most previous work focuses on offsite diagnosis, i.e.development site diagnosis with the programmers present. This is insufficient for production-run failures as: (1) it is difficult to reproduce failures offsite for diagnosis; (2) offsite diagnosis cannot provide timely guidance for recovery or security purposes; (3)it is infeasible to provide a programmer to diagnose every production run failure; and (4) privacy concerns limit the release of information(e.g. coredumps) to programmers. To address production-run failures, we propose a system, called Triage , that automatically performs onsite software failure diagnosis at the very moment of failure. It provides a detailed diagnosis report, including the failure nature, triggering conditions, related code and variables, the fault propagation chain, and potential fixes. Triage achieves this by leveraging lightweight reexecution support to efficiently capture the failure environment and repeatedly replay the moment of failure, and dynamically--using different diagnosis techniques--analyze an occurring failure. Triage employs afailure diagnosis protocol that mimics the steps a human takes in debugging. This extensible protocol provides a framework to enable the use of various existing and new diagnosis techniques. We also propose a new failure diagnosis technique, delta analysis , to identify failure related conditions, code, and variables. We evaluate these ideas in real system experiments with 10 real software failures from 9 open source applications including four servers. Triage accurately diagnoses the evaluated failures, providing likely root causes and even the fault propagation chain, while keeping normal-run overhead to under 5%. Finally, our user study of the diagnosis and repair of real bugs shows that Triagesaves time (99.99% confidence), reducing the total time to fix by almost half. |
Databáze: | OpenAIRE |
Externí odkaz: |