Popis: |
Conventional distributed data storage services, like databases and file systems, rely on replication for fault tolerance; as a consequence, the performance of these services depends heavily on the performance of the underlying replication system in use. Existing replication systems, built using a replication protocol (e.g., CURP), are implemented as user-level processes capable of performing replication with relatively low latencies (~10+ μs). However, such user-level processes are susceptible to performance degradation at scale, due to software overheads (e.g., operating system and networking stack), and contention for server resources (e.g., CPU, disk, and memory) between multiple processes; thus, leading to higher latencies with longer tails.In this paper, we posit that replication systems can achieve lower latency and scale without compromising (tail) latencies by exploiting the characteristics of emerging programmable data planes. To support our thesis, we build a system, called ARGUS, that takes advantage of data plane's close proximity to the wire, minimal software overhead, and line-rate throughput to accelerate replication. ARGUS maps key components of a replication protocol (e.g., replication, storage, and recovery) to various processing and memory regions of SmartNICs equipped with a programmable match-action data plane. Our preliminary evaluation shows that, in comparison to CURP, ARGUS reduces mean and 99.9th-percentile latencies by 2x and 2.2x respectively, with 6.7x higher through-put. In addition, it lowers the gap between the 99.9th-percentile and median latencies by about 3.3x. Finally, increasing the replication factor in ARGUS has a negligible effect on the tail latency of the system, i.e., an increase of 0.12 μs per witness compared to 12.86 μs in CURP. |