ExaWorks Software Development Kit: A Robust and Scalable Collection of Interoperable Workflow Technologies

Autor: Turilli, Matteo, Hategan-Marandiuc, Mihael, Titov, Mikhail, Maheshwari, Ketan, Alsaadi, Aymen, Merzky, Andre, Arambula, Ramon, Zakharchanka, Mikhail, Cowan, Matt, Wozniak, Justin M., Wilke, Andreas, Kilic, Ozgur Ozan, Chard, Kyle, da Silva, Rafael Ferreira, Jha, Shantenu, Laney, Daniel
Rok vydání: 2024
Předmět:
Druh dokumentu: Working Paper
Popis: Scientific discovery increasingly requires executing heterogeneous scientific workflows on high-performance computing (HPC) platforms. Heterogeneous workflows contain different types of tasks (e.g., simulation, analysis, and learning) that need to be mapped, scheduled, and launched on different computing. That requires a software stack that enables users to code their workflows and automate resource management and workflow execution. Currently, there are many workflow technologies with diverse levels of robustness and capabilities, and users face difficult choices of software that can effectively and efficiently support their use cases on HPC machines, especially when considering the latest exascale platforms. We contributed to addressing this issue by developing the ExaWorks Software Development Kit (SDK). The SDK is a curated collection of workflow technologies engineered following current best practices and specifically designed to work on HPC platforms. We present our experience with (1) curating those technologies, (2) integrating them to provide users with new capabilities, (3) developing a continuous integration platform to test the SDK on DOE HPC platforms, (4) designing a dashboard to publish the results of those tests, and (5) devising an innovative documentation platform to help users to use those technologies. Our experience details the requirements and the best practices needed to curate workflow technologies, and it also serves as a blueprint for the capabilities and services that DOE will have to offer to support a variety of scientific heterogeneous workflows on the newly available exascale HPC platforms.
Databáze: arXiv