Topology-aware resource management for HPC applications

Autor:	Adèle Villiermet, Emmanuel Jeannot, Guillaume Mercier, Yiannis Georgiou
Přispěvatelé:	Bull SAS ( Bull ), Bull SAS, Topology-Aware System-Scale Data Management for High-Performance Computing ( TADAAM ), Laboratoire Bordelais de Recherche en Informatique ( LaBRI ), Centre National de la Recherche Scientifique ( CNRS ) -École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Université Sciences et Technologies - Bordeaux 1-Université Bordeaux Segalen - Bordeaux 2-Centre National de la Recherche Scientifique ( CNRS ) -École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Université Sciences et Technologies - Bordeaux 1-Université Bordeaux Segalen - Bordeaux 2-Inria Bordeaux - Sud-Ouest, Institut National de Recherche en Informatique et en Automatique ( Inria ) -Institut National de Recherche en Informatique et en Automatique ( Inria ), Centre National de la Recherche Scientifique ( CNRS ) -École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Université Sciences et Technologies - Bordeaux 1-Université Bordeaux Segalen - Bordeaux 2, ITEA3 COLOC #13024, ANR-13-INFR-0001,MOEBUS,Gestion de ressources multi-objectifs pour plates-formes de calcul à large échelle ( 2013 ), Bull SAS (Bull), Topology-Aware System-Scale Data Management for High-Performance Computing (TADAAM), Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Inria Bordeaux - Sud-Ouest, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB), Institut Polytechnique de Bordeaux (Bordeaux INP), Inria Bordeaux Sud-Ouest, Bordeaux INP, LaBRI - Laboratoire Bordelais de Recherche en Informatique, Tadaam, ANR-13-INFR-0001,MOEBUS,Gestion de ressources multi-objectifs pour plates-formes de calcul à large échelle(2013), Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)-Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)-Inria Bordeaux - Sud-Ouest, Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)
Jazyk:	angličtina
Rok vydání:	2017
Předmět:	[ INFO ] Computer Science [cs] Computer science Distributed computing 02 engineering and technology Topology computer.software_genre Scheduling (computing) calcul scientifique [ INFO.INFO-DC ] Computer Science [cs]/Distributed Parallel and Cluster Computing [cs.DC] 0202 electrical engineering electronic engineering information engineering topology-aware placement [INFO]Computer Science [cs] Plug-in resource management scheduling Slurm topologie 020203 distributed computing Global system Emulation business.industry placement de processus Total cost of ownership Supercomputer Job management Computing platforms HPC 020201 artificial intelligence & image processing [INFO.INFO-DC]Computer Science [cs]/Distributed Parallel and Cluster Computing [cs.DC] business computer job allocation Computer network System software gestionnaire de ressources
Zdroj:	ICDCN 2017-18th International Conference on Distributed Computing and Networking ICDCN 2017-18th International Conference on Distributed Computing and Networking, Jan 2017, Hyderabad, India. 〈10.1145/3007748.3007768〉 [Research Report] RR-8859, Inria Bordeaux Sud-Ouest ; Bordeaux INP; LaBRI-Laboratoire Bordelais de Recherche en Informatique. 2016, pp.17 ICDCN ICDCN 2017-18th International Conference on Distributed Computing and Networking, Jan 2017, Hyderabad, India. ⟨10.1145/3007748.3007768⟩
DOI:	10.1145/3007748.3007768〉
Popis:	International audience; The Resource and Job Management System (RJMS) is a crucial system software part of the HPC stack. It is responsible for efficiently delivering computing power to applications in supercomputing environments. Its main intelligence relies on resource selection techniques to find the most adapted resources to schedule the users' jobs. Improper resource selection operations may lead to poor performance executions and global system utilization along with an increase of the system fragmentation and jobs starvation. These phenomena play a role in the increase of the platforms' total cost of ownership and should be minimized. This paper introduces a new method that takes into account the topology of the machine and the application characteristics to determine the best choice among the available nodes of the platform based upon their position within the network and taking into account the applications communication pattern. To validate our approach, we integrate this algorithm as a plugin for Slurm, a popular and widespread HPC resource and job management system (RJMS). We assess our plugin with different optimization schemes by comparing with the default topology-aware Slurm algorithm using both emulation and simulation of a large-scale platform, and by carrying out experiments in a real cluster. We show that transparently taking into account the job communication pattern and the topology allows for relevant performance gains.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::9cc14fe7c999a37a8a3e3f888caa9f53 https://hal.inria.fr/hal-01414196 Zobrazit plný text záznamu