Popis: |
Ensuring the availability and elasticity of Infrastructure-as-a-Service (IaaS) and Platform-as-a-Service (PaaS) cloud environments is both critical and challenging for providers and consumers of these services. Automatic load balancing and failure recovery services are commonly offered by IaaS and PaaS providers to help achieve these goals. In this paper, we present Phantom: a failure simulation, monitoring, and profiling framework whose main goal is to ensure adequate dependability and availability of cloud subsystems. Phantom utilizes failure simulation to “perturb” the cloud during normal operating times, while monitoring and profiling service availability as perceived by the end user. Unlike traditional cloud monitoring systems, Phantom is an “active meta-monitor” that can detect degradation of the cloud's own failure detector and recovery systems. We describe Phantom's main contributions, i.e., (1) Havoc, an extensible cloud failure simulator, and (2) a meta-monitor and analytic method for continuously profiling failure detector and recovery service quality. We demonstrate Phantom's sensitivity to service degradation through a variety of experiments conducted on an open cloud platform. |