Corrected trees for reliable group communication
Autor: | Martin Küttler, Carsten Weinhold, Torsten Hoefler, Maksym Planeta, Jan Bierbaum, Amnon Barak, Hermann Härtig |
---|---|
Rok vydání: | 2019 |
Předmět: |
020203 distributed computing
business.industry Gossip Computer science Communication in small groups 0202 electrical engineering electronic engineering information engineering Graph (abstract data type) 020207 software engineering 02 engineering and technology Latency (engineering) business Supercomputer Computer network |
Zdroj: | PPoPP Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming |
DOI: | 10.1145/3293883.3295721 |
Popis: | Driven by ever increasing performance demands of compute-intensive applications, supercomputing systems comprise more and more nodes. This growth is a significant burden for fast group communication primitives and also makes those systems more susceptible to failures of individual nodes. In this paper we present a two-phase fault-tolerant scheme for group communication. Using broadcast as an example, we provide a full-spectrum discussion of our approach --- from a formal analysis to LogP-based simulations to a message-passing-based implementation running on a large cluster. Ultimately, we are able to reduce the complex problem of reliable and fault-tolerant collective group communication to a graph theoretical renumbering problem. Both, simulations and measurements, show our solution to achieve a latency reduction of 50% with up to six times fewer messages sent in comparison to existing schemes. |
Databáze: | OpenAIRE |
Externí odkaz: |