sagetasks: a Python package for data and workflow orchestration in the cloud

Autor: Grande, Bruno, Thyer, Tess M, Eddy, James, Yu, Thomas, O'Connor, Brian
Rok vydání: 2022
DOI: 10.6084/m9.figshare.21097150.v1
Popis: Workflow execution engines such as Cavatica and Nextflow Tower provide several benefits for processing scientific data, including scalability, portability, and caching. However, these workflows are often part of larger extract-transform-load (ETL) pipelines for a number of reasons, including (1) input and output data being stored in different locations, which is common in the cloud context, (2) the workflow parameters requiring to be extracted from other data sources (e.g., a file manifest, a database query) and formatted accordingly, and (3) multiple community-curated workflows needing to be chained together. These use cases motivated us to develop the sagetasks Python package. This tool is a growing collection of reusable functions for orchestrating data and workflows on various platforms. Our ultimate goal is to call these functions from a general-purpose workflow management system (WMS) such as Airflow to streamline and automate the deployment of these ETL pipelines. A WMS would also provide pipeline monitoring, scheduling, and logging. We aim to be platform-agnostic and leverage standard APIs where possible. Notably, the GA4GH Workflow Execution Service (WES) presents an opportunity to minimize the amount of platform-specific code in favor of a common abstraction. This poster will present the considerations we have identified for deploying a WMS for managing high-level ETL pipelines. We will also discuss suggested changes to the WES API that would improve its integration with a WMS. In summary, we believe that sagetasks can be a valuable addition to the GA4GH ecosystem as a real-world client of the WES API, which we will use to facilitate a variety of workflow needs, including data coordination projects and model-to-data analysis applications.
Databáze: OpenAIRE