Network Aware Reliability Analysis for Distributed Storage Systems
Autor: | Dmitry Sotnikov, Amir Epstein, Elliot K. Kolodner |
---|---|
Rok vydání: | 2016 |
Předmět: |
020203 distributed computing
Computer science Reliability (computer networking) Distributed computing 020206 networking & telecommunications 02 engineering and technology Data loss Durability Replication (computing) Reliability engineering Distributed data store 0202 electrical engineering electronic engineering information engineering Bandwidth (computing) Erasure code Cloud storage |
Zdroj: | SRDS |
DOI: | 10.1109/srds.2016.042 |
Popis: | It is hard to measure the reliability of a large distributed storage system, since it is influenced by low probability failure events that occur over time. Nevertheless, it is critical to be able to predict reliability in order to plan, deploy and operate the system. Existing approaches suffer from unrealistic assumptions regarding network bandwidth. This paper introduces a new framework that combines simulation and an analytic model to estimate durability for large distributed cloud storage systems. Our approach is the first that takes into account network bandwidth with a focus on the cumulative effect of simultaneous failures on repair time. Using our framework we evaluate the trade-offs between durability, network and storage costs for the OpenStack Swift object store, comparing various system configurations and resiliency schemes, including replication and erasure coding. In particular, we show that when accounting for the cumulative effect of simultaneous failures, the probability of data loss estimates can vary by two to four orders of magnitude. |
Databáze: | OpenAIRE |
Externí odkaz: |