Fail-Slow at Scale
Autor: | Nematollah Bidokhti, Caitie McCaffrey, Gary Grider, Andree Jacobson, Tim Emami, Deepthi Srinivasan, Riza O. Suminto, Peter Alvaro, Casey Golliher, Robert Ross, Biswaranjan Panda, Kevin Harms, Xing Lin, Robert Ricci, H. Birali Runesha, Russell Sears, Huaicheng Li, Haryadi S. Gunawi, Andrew D. Baptist, Kirk Webb, Weiguang Sheng, Mingzhe Hao, Swaminathan Sundararaman, Parks Fields |
---|---|
Rok vydání: | 2018 |
Předmět: |
010302 applied physics
business.industry Computer science Scale (chemistry) 020206 networking & telecommunications 02 engineering and technology Cluster (spacecraft) 01 natural sciences Hardware and Architecture 0103 physical sciences 0202 electrical engineering electronic engineering information engineering Production (economics) business Failure mode and effects analysis Computer hardware Jitter |
Zdroj: | ACM Transactions on Storage. 14:1-26 |
ISSN: | 1553-3093 1553-3077 |
Popis: | Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of fail-slow hardware incidents, collected from large-scale cluster deployments in 14 institutions. We show that all hardware types such as disk, SSD, CPU, memory, and network components can exhibit performance faults. We made several important observations such as faults convert from one form to another, the cascading root causes and impacts can be long, and fail-slow faults can have varying symptoms. From this study, we make suggestions to vendors, operators, and systems designers. |
Databáze: | OpenAIRE |
Externí odkaz: |