SDM: A Scientific Dataset Delivery Platform

Autor: John H. Hartman, Larry L. Peterson, Illyoung Choi, Jude Nelson
Rok vydání: 2019
Předmět:
Zdroj: eScience
DOI: 10.1109/escience.2019.00049
Popis: Scientific computing is becoming more data-centric and more collaborative, requiring increasingly large datasets to be transferred across the Internet. Transferring these datasets efficiently and making them accessible to scientific workflows is an increasingly difficult task. In addition, the data transfer time can be a significant portion of the overall workflow running time. This paper presents SDM (Syndicate Dataset Manager), a scientific dataset delivery platform. Unlike general-purpose data transfer tools, SDM offers on-demand access to remote scientific datasets. On-demand access doesn't require staging datasets to local file systems prior to computing on them, and it also enables overlapping computation and I/O. In addition, SDM offers a simple interface for users to locate and access datasets. To validate the usefulness of SDM, we performed realistic metagenomic sequence analysis workflows on remote genomic datasets. In general, SDM configured with a CDN outperforms existing data access methods. With warm CDN caches, SDM completes the workflow 17-20% faster than staging methods. Its performance is even comparable to local storage. SDM is only 9% slower than local HDD storage and 18% slower than local SSD storage. Together, its performance and its ease-of-use make SDM an attractive platform for performing scientific workflows on remote datasets.
Databáze: OpenAIRE