Lineage stash
Autor: | Philipp Moritz, Robert Nishihara, John Liagouris, Ujval Misra, Alexey Tumanov, Stephanie Wang, Ion Stoica |
---|---|
Rok vydání: | 2019 |
Předmět: |
Lineage (genetic)
Computer science Distributed computing Mission critical ComputingMilieux_PERSONALCOMPUTING 020206 networking & telecommunications 020207 software engineering Fault tolerance 02 engineering and technology Stream processing Task (computing) Computer cluster Dryad (programming) 0202 electrical engineering electronic engineering information engineering Critical path method |
Zdroj: | SOSP |
Popis: | As cluster computing frameworks such as Spark, Dryad, Flink, and Ray are being deployed in mission critical applications and on larger and larger clusters, their ability to tolerate failures is growing in importance. These frameworks employ two broad approaches for fault tolerance: checkpointing and lineage. Checkpointing exhibits low overhead during normal operation but high overhead during recovery, while lineage-based solutions make the opposite tradeoff. We propose the lineage stash, a decentralized causal logging technique that significantly reduces the runtime overhead of lineage-based approaches without impacting recovery efficiency. With the lineage stash, instead of recording the task's information before the task is executed, we record it asynchronously and forward the lineage along with the task. This makes it possible to support large-scale, low-latency (millisecond-level) data processing applications with low runtime and recovery overheads. Experimental results for applications in distributed training and stream processing show that the lineage stash provides task execution latencies similar to checkpointing alone, while incurring a recovery overhead as low as traditional lineage-based approaches. |
Databáze: | OpenAIRE |
Externí odkaz: |