Cooperative human-machine data extraction from biological collections
Autor: | Mauricio Tsugawa, José A. B. Fortes, Andrea Matsunaga, Icaro Alzuru |
---|---|
Rok vydání: | 2016 |
Předmět: |
0106 biological sciences
Information retrieval business.industry Computer science Process (engineering) 010607 zoology Optical character recognition Crowdsourcing computer.software_genre 010603 evolutionary biology 01 natural sciences Software Workflow Data extraction ComputingMethodologies_DOCUMENTANDTEXTPROCESSING Tesseract Data mining business computer Digitization |
Zdroj: | eScience |
DOI: | 10.1109/escience.2016.7870884 |
Popis: | Historical data sources, like medical records or biological collections, consist of unstructured heterogeneous content: handwritten text, different sizes and types of fonts, and text overlapped with lines, images, stamps, and sketches. The information these documents can provide is important, from a historical perspective and mainly because we can learn from it. The automatic digitization of these historical documents is a complex machine learning process that usually produces poor results, requiring costly interventions by experts, who have to transcribe and interpret the content. This paper describes hybrid (Human- and Machine-Intelligent) workflows for scientific data extraction, combining machine-learning and crowdsourcing software elements. Our results demonstrate that the mix of human and machine processes has advantages in data extraction time and quality, when compared to a machine-only workflow. More specifically, we show how OCRopus and Tesseract, two widely used open source Optical Character Recognition (OCR) tools, can improve their accuracy by more than 42%, when text areas are cropped by humans prior to OCR, while the total time can increase or decrease depending on the OCR selection. The digitization of 400 images, with Entomology, Bryophyte, and Lichen specimens, is evaluated following four different approaches: processing the whole specimen image (machine-only), processing crowd cropped labels (hybrid), processing crowd cropped fields (hybrid), and cleaning the machine-only output. As a secondary result, our experiments reveal differences in speed and quality between Tesseract and OCRopus. |
Databáze: | OpenAIRE |
Externí odkaz: |