Busy Bees: An exemplary solution to data-handling and -transport problems for research depending on external staff and volunteers

Autor:	von Waldow, Harald, Stahl, Johanna, Thomas, Celina, Hoedt, Florian
Rok vydání:	2022
Předmět:	research IT citizen science data stewardship research data management research software engineering data_stewardship_2022_tu9
DOI:	10.5281/zenodo.7213191
Popis:	Introduction The project “National Monitoring of Biodiversity in Agricultural Landscapes” (MonViA) started in March 2019, will run through 2023, is a collaborative effort of the Thünen Institute, the Julius-Kühn Institute and the Federal Office for Agriculture and Food. The Thünen Institute for Biodiversity is in charge of one of the sub-projects, Population-friendly wild bee monitoring in agricultural landscapes - conserving and promoting wild bees together with volunteers. Next to the natural science aspect concerned with the development, evaluation and application of trap nests for the large-scale monitoring of wild bee species, abundance and trophic interactions, an explicit goal is to find out whether the long term and nationwide involvement of volunteers can be counted on to produce high-quality data. Set up About 500 trap nests are distributed across Germany. The trap nests are cared for and maintained by about 100 volunteers. On trap nest comprises 25 boards. The raw data we are concerned with here are photographs of these boards. One board contains 10 borings in which the actual nesting takes place. One boring contain an unspecified number of brood cells. There are 6 observation campaigns per year, one per month (April - September). This results in 75k images per year with a volume up to 750 GB. Raw data collection and pre-sorting The volunteers photograph boards and possibly other motives (trap nest situation, less related material, …). The images are copied to USB-sticks and sent to researchers at Thünen Institute, who then manually sort this material. That is unavoidable, since not all volunteers can be brought to deliver machine readably sorted and labeled images. The images from USB sticks are copied to the internal storage system (SMB shares). Pre-sorting consists of singling out any images that do not depict trap nest boards and placing the images in specific directory-tree. Data preparation and transfer to external research assistants Information from the images needs to be extracted and recorded in a spreadsheet (but see section Potential improvements by research assistants with special training. Challenges The research assistants do not have access to the internal SMB shares, where the raw data is stored. They need to download the images from their work places at home. Project-specific data extraction from the Thünen Institute network is currently realized via ThünenCloud, a NextCloud installation. The chances that research assistants mistakenly assign an observation to the wrong board needs to be minimized. It is necessary to provide a “userfriendly” and error-resilient workflow for data-entry, where the software components for this workflow are already prescribed: a web-browser pointing at a shared NextCloud folder, a graphical file-manager, e.g.Windows Explorer, an image viewer, and a data-entry spreadsheet. Solution A Python program, run regularly on a dedicated VM, creates “packages” in the form of zip-archives that are copied to shared folders on the NextCloud instance (mounted as WebDAV filesystem) and can be downloaded by the research assistants. The packages extract to a specific folder-structure containing the images to be examined, along with pre-filled entry forms (spreadsheets). Special care has been taken to organize folder-structure, image names, and entry-forms so that data entry errors are minimized and data-entry efficiency and ease are maximized. We provide one spreadsheet per image to reduce the chance that observations are assigned to the wrong board. The spreadsheets is pre-filled with all information that can be deduce from the raw input-data structure. The image/spreadsheet pair is presented as the only content of one sub-folder in the zip-archive. The spreadsheet itself was designed to guard against erroneous entries by implementing functionality for validation and user-friendliness. The system is backed by a relational database within a centrally provided PostgreSQL RDBMS. We implement the system using Docker containers to improve deployment- and re-use options. The jury is still out on the question whether the benefits outweigh the drawbacks of this approach. Since data-safety and -integrity has highest priority, the program is an exercise in defensive design. Meta-information exists in pathnames and in the database. Moving data reliably between different filesystems on different machines of which load, latency and general availability is not know requires extensive checks. Once a package is created for distribution, it is also copied to a secure internal location. Before the original file is removed, we assure the identity of hashes calculated over the original and the two copies. Potential improvements Some elements of this part of the data-lifecycle might seem inefficient and old-fashioned at first sight, e.g.the transport via USB stick and the use of Excel spreadsheets. Taking into account the resources that would be required for an alternative solution and the involvement of a large number of volunteers with heterogeneous technical affinity & skills, equipment, internet connection etc., it is hard to beat the USB stick solution. The lowest hanging fruit is perhaps the replacement of error-prone Excel spreadsheets with a web-based data-entry service. Further developments The system as described above currently enters beta-testing. Until the workshop is being held, we hope to receive user-feedback which we would report of course at the workshop as well. An immediate next work-package is the support of the handling of entry-forms returning from the research assistants as upload to ThünenCloud. We aim at providing a semi-automatic mechanism that loads these data directly into the a central, project-specific relational database. In the future we anticipate to help with publishing these data (and/or data artifacts further down in the process chain of scientific production) through the Thünen Atlas (https://atlas.thuenen.de).
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::0c1af439fbd4943c18f61ac7384cf452 Zobrazit plný text záznamu