Popis: |
Data processing applications often combine computations with disparate fault-tolerance requirements. For example, batch computations prioritize throughput over recovery latency, and can tolerate recovery delays of up to several minutes, while streaming computations expect recovery latencies of at most a few seconds. However, state-of-the-art data systems each offer a single fault-tolerance regime, so complex applications either: (i) suffer performance degradation in steady state and during recovery due to the poor fit of the fault-tolerance regime for parts of the applications, or (ii) are difficult to maintain because they are developed using fragile combinations of batch and streaming systems that provide different APIs and schedulers, and evolve independently. This paper describes Falkirk Wheel, a design for rollback recovery that enables applications to combine different fault-tolerance regimes. Falkirk Wheel provides a design based on logical times, which is expressive enough for general applications including incremental and iterative computations. Our experiments show that an implementation of Falkirk Wheel in Naiad successfully combines fault-tolerance regimes, with an order of magnitude lower response latencies in steady state than Naiad's batch-tuned fault-tolerance. Moreover, Falkirk Wheel is competitive with streaming systems tuned for single fault-tolerance regimes, as it provides 3-5x lower response latencies than Flink and Drizzle in steady state and during failure recovery on the Yahoo! Streaming Benchmark. |