Burst-tolerant datacenter networks with Vertigo

Autor: Erfan Sharafzadeh, Sepehr Abdous, Soudeh Ghorbani
Rok vydání: 2021
Předmět:
Zdroj: Proceedings of the 17th International Conference on emerging Networking EXperiments and Technologies.
DOI: 10.1145/3485983.3494873
Popis: Microsecond-scale congestion events, known as microbursts, are a main cause of packet loss and poor application performance in today's datacenters. Given the low network utilization in datacenters, one would expect packet deflection, in-situ re-routing of packets that arrive at a full buffer to a different port, to effectively prevent packet loss. However, if deployed naively, deflection leads to excessive packet re-ordering, exacerbated congestion, and head-of-the-line blocking in switch buffers. In this study, we resolve the above challenges by selectively deflecting the packets that cause persistent congestion in the network. To enable this, we augment the end-host network stacks with a transport-independent extension that tracks and marks flows with their remaining bytes. Our in-network deflection component uses the flow size information to re-route packets from flows with more data to send. Finally, an extension to the receive-side of end-host stacks retrieves the correct ordering of packets before passing them to transport and higherlevel protocols. We evaluate our design, Vertigo, under diverse datacenter workloads and show that it is effective in managing microbursts under light and heavy loads and when combined with various congestion control algorithms. For example, in a leaf-spine network under 85% load, Vertigo reduces the mean incast query completion times by 3.5x, 3.3x, 5x compared to ECMP, DRILL, and DIBS when using TCP, 3x, 3.5x, 4.5x alongside DCTCP, and 43x, 33x, 16x when using Swift, respectively.
Databáze: OpenAIRE