AGGREGATHOR: Byzantine Machine Learning via Robust Gradient Aggregation

Autor:	Damaskinos, Georgios, El Mhamdi, El Mahdi, Guerraoui, Rachid, Guirguis, Arsany Hany Abdelmessih, Rouault, Sébastien Louis Alexandre
Předmět:	Distributed machine learning ml-ai Byzantine resilience fault tolerance
Popis:	We present AGGREGATHOR, a framework that implements state-of-the-art robust (Byzantine-resilient) distributed stochastic gradient descent. Following the standard parameter server model, we assume that a minority of worker machines can be controlled by an adversary and behave arbitrarily. Such a setting has been theoretically studied with several of the existing approaches using a robust aggregation of the workers’ gradient estimations. Yet, the question is whether a Byzantine-resilient aggregation can leverage more workers to speedup learning. We answer this theoretical question, and implement these state-of-the-art theoretical approaches on AGGREGATHOR, to assess their practical costs. We built AGGREGATHOR around TensorFlow and introduce modifications for vanilla TensorFlow towards making it usable in an actual Byzantine setting. AGGREGATHOR also permits the use of unreliable gradient transfer over UDP to provide further speed-up (without losing the accuracy) over the native communication protocols (TCP-based) of TensorFlow in saturated networks. We quantify the overhead of Byzantine resilience of AGGREGATHOR to 19% and 43% (to ensure weak and strong Byzantine resilience respectively) compared to vanilla TensorFlow.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=od_______185::803f930c8856ba090afd07ad7f6f9740 https://infoscience.epfl.ch/record/265684 Zobrazit plný text záznamu