On Assisting Scientific Data Curation in Collection-Based Dataflows Using Labels

Autor: Pinar Alper, Khalid Belhajjame, Carole A. Goble
Rok vydání: 2013
Popis: Thanks to the proliferation and adoption of computational tools and analysis, scientists are nowadays producing large amounts of datasets. Sharing and publishing such datasets is key to scientific progress, e.g., scientists can analyze datasets produced by their peers to investigate a new hypothesis. Genuine reuse of such datasets can however only be achieved if the are curated using metadata that describe, among other aspects, the context in which they were produced, the datasets from which they were derived and the people involved in their generation. By and large, the curation process is man- ual, tedious, repetitive and time consuming.In this paper, we investigate the problem of curating data artifacts resulting from workflow-based analyses. Scientific workflows have gained momentum in the last decade as a means for specifying and automating the repetitive execu- tion of experiments. Most workflow systems have been in- strumented to automatically gather provenance information about the data artifacts generated as a result of the workflow execution. While such raw provenance traces provide useful information on the lineage of the data artifacts, our inter- actions with scientists from modern sciences, in particular bioinformatics and biodiversity, suggests that they are not sufficient for curating data artifacts from the data publica- tion point of view. To assist scientists in the curation of such data artifacts, we propose in this paper a novel approach that semi-automates the curation process by exploiting the specification of the workflow incarnating the experiment, the raw provenance traces resulting from its execution as well as motif annotations that describe the data manipulation car- ried out by the workflow steps. We semi-formally describe the elements of our solution, and showcase its usefulness us- ing a real use case from the biodiversity field.
Databáze: OpenAIRE