Ensemble Labeling Towards Scientific Information Extraction (ELSIE)

Autor: Erin Murphy, Jacob D. Furst, Roselyne Tchoua, Alexander Rasin, Daniela Raicu
Rok vydání: 2021
Předmět:
Zdroj: Computational Science – ICCS 2021 ISBN: 9783030779603
ICCS (1)
DOI: 10.1007/978-3-030-77961-0_60
Popis: Extracting scientific facts from unstructured text is difficult due to challenges specific to the complexity of the scientific named entities and relations to be extracted. This problem is well illustrated through the extraction of polymer names and their properties. Even in the cases where the property is a temperature, identifying the polymer name associated with the temperature may require expertise due to the use of complicated naming conventions and by the fact that new polymer names are being “introduced” into the lexicon as polymer science advances. While domain-specific machine learning toolkits exist that address these challenges, perhaps the greatest challenge is the lack of—time-consuming, error-prone and costly—labeled data to train these machine learning models. This work repurposes Snorkel, a data programming tool, in a novel approach as a way to identify sentences that contain the relation of interest in order to generate training data, and as a first step towards extracting the entities themselves. By achieving 94% recall and an F1 score of 0.92, compared to human experts who achieve 77% recall and an F1 score of 0.87, we show that our system captures sentences missed by both a state-of-the-art domain-aware natural language processing toolkit and human expert labelers. We also demonstrate the importance of identifying the complex sentences prior to extraction by comparing our application to the natural language processing toolkit.
Databáze: OpenAIRE