Fail-Slow at Scale

Autor: Nematollah Bidokhti, Caitie McCaffrey, Gary Grider, Andree Jacobson, Tim Emami, Deepthi Srinivasan, Riza O. Suminto, Peter Alvaro, Casey Golliher, Robert Ross, Biswaranjan Panda, Kevin Harms, Xing Lin, Robert Ricci, H. Birali Runesha, Russell Sears, Huaicheng Li, Haryadi S. Gunawi, Andrew D. Baptist, Kirk Webb, Weiguang Sheng, Mingzhe Hao, Swaminathan Sundararaman, Parks Fields
Rok vydání: 2018
Předmět:
Zdroj: ACM Transactions on Storage. 14:1-26
ISSN: 1553-3093
1553-3077
Popis: Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of fail-slow hardware incidents, collected from large-scale cluster deployments in 14 institutions. We show that all hardware types such as disk, SSD, CPU, memory, and network components can exhibit performance faults. We made several important observations such as faults convert from one form to another, the cascading root causes and impacts can be long, and fail-slow faults can have varying symptoms. From this study, we make suggestions to vendors, operators, and systems designers.
Databáze: OpenAIRE