Testing consensus implementations using communication closure

Autor: Cezara Drăgoi, Burcu Kulahcioglu Ozkan, Filip Niksic, Rupak Majumdar, Constantin Enea
Přispěvatelé: Analyse Statique par Interprétation Abstraite (ANTIQUE), Département d'informatique de l'École normale supérieure (DI-ENS), École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Inria de Paris, Institut National de Recherche en Informatique et en Automatique (Inria), Institut de Recherche en Informatique Fondamentale (IRIF (UMR_8243)), Centre National de la Recherche Scientifique (CNRS)-Université de Paris (UP), Koç University, Max Planck Institute for Software Systems (MPI-SWS), University of Pennsylvania [Philadelphia], Kulahcioglu Ozkan and Majumdar were supported in part by the Deutsche Forschungsgemeinschaft project 389792660 TRR 248 and by the European Research Council under the Grant Agreement 610150 (ERC Synergy Grant ImPACT). This work was done mainly when Cezara Drăgoi was affiliated with INRIA supported by the French National Research Agency ANR project SAFTA(12744-ANR-17-CE25-0008-01)., ANR-17-CE25-0008,SAFTA,Analyse statique des algorithmes distribués tolérants aux pannes(2017), Département d'informatique - ENS Paris (DI-ENS), École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS-PSL), Centre National de la Recherche Scientifique (CNRS)-Université Paris Cité (UPCité), University of Pennsylvania, Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Inria de Paris
Jazyk: angličtina
Rok vydání: 2021
Předmět:
Zdroj: SPLASH 2020 : ACM SIGPLAN conference on Systems, Programming, Languages, and Applications: Software for Humanity
SPLASH 2020 : ACM SIGPLAN conference on Systems, Programming, Languages, and Applications: Software for Humanity, Oct 2021, Chiccago / Virtual, United States. ⟨10.1145/3428278⟩
Proc. {ACM} Program. Lang.
Proceedings of the ACM on Programming Languages
Popis: Large scale production distributed systems are difficult to design and test. Correctness must be ensured when processes run asynchronously, at arbitrary rates relative to each other, and in the presence of failures, e.g., process crashes or message losses. These conditions create a huge space of executions that is difficult to explore in a principled way. Current testing techniques focus on systematic or randomized exploration of all executions of an implementation while treating the implemented algorithms as black boxes. On the other hand, proofs of correctness of many of the underlying algorithms often exploit semantic properties that reduce reasoning about correctness to a subset of behaviors. For example, the communication-closure property, used in many proofs of distributed consensus algorithms, shows that every asynchronous execution of the algorithm is equivalent to a lossy synchronous execution, thus reducing the burden of proof to only that subset. In a lossy synchronous execution, processes execute in lock-step rounds, and messages are either received in the same round or lost forever—such executions form a small subset of all asynchronous ones. We formulate the communication-closure hypothesis , which states that bugs in implementations of distributed consensus algorithms will already manifest in lossy synchronous executions and present a testing algorithm based on this hypothesis. We prioritize the search space based on a bound on the number of failures in the execution and the rate at which these failures are recovered. We show that a random testing algorithm based on sampling lossy synchronous executions can empirically find a number of bugs—including previously unknown ones—in production distributed systems such as Zookeeper, Cassandra, and Ratis, and also produce more understandable bug traces.
Databáze: OpenAIRE