Specimen Data Refinery: A novel approach to automating digitisation

Autor: Livermore, Laurence, Brack, Paul, Scott, Ben, Woolland, Oliver
Rok vydání: 2022
DOI: 10.6084/m9.figshare.19947845
Popis: Conference: SPNHC 2022 Session: Identifiers and labels in natural history collections Presentation Date: 2022-06-07 Location: Edinburgh, UK Abstract There are two main rate limiting steps in mass digitisation of natural history collections: 1) physical handling - the rate at which we can retrieve, select and prepare specimens for digitisation, then returning them to collections; 2) the extraction of data from images - either from the specimen itself or from its labels - e.g. measurements, transcription, georeferencing. Over the past three years we have been developing the Specimen Data Refinery (SDR) to dramatically scale up the extraction of data from specimen images in an automated way that conforms to FAIR (Findable, Accessible, Interoperable and Repeatable) principles. The SDR uses a series of machine learning models, packed into modular tools, that perform semantic segmentation, optical character recognition, hand-written text recognition, barcode reading and natural language processing to identify labels, text lines, and named entities. We present the SDR and an evaluation of its use in automating the linkage between specimens, their UIDs, and for related linked data like taxonomy, people and geographic names. We will discuss outstanding challenges and potential for future development.
Databáze: OpenAIRE