A Methodology for Open Information Extraction and Representation from Large Scientific Corpora: The CORD-19 Data Exploration Use Case

Autor: Antonis Litke, Nikolaos Papadakis, Dimitris Papadopoulos
Rok vydání: 2020
Předmět:
Information extraction
Bioinformatics
Computer science
Triple extraction
02 engineering and technology
Ontology (information science)
computer.software_genre
lcsh:Technology
lcsh:Chemistry
Set (abstract data type)
03 medical and health sciences
0202 electrical engineering
electronic engineering
information engineering

Redundancy (engineering)
General Materials Science
information extraction
Representation (mathematics)
Data mining
lcsh:QH301-705.5
Instrumentation
030304 developmental biology
Fluid Flow and Transfer Processes
0303 health sciences
Coreference
triple extraction
lcsh:T
business.industry
Process Chemistry and Technology
General Engineering
bioinformatics
data mining
Pipeline (software)
Automatic summarization
lcsh:QC1-999
Computer Science Applications
lcsh:Biology (General)
lcsh:QD1-999
lcsh:TA1-2040
020201 artificial intelligence & image processing
Artificial intelligence
lcsh:Engineering (General). Civil engineering (General)
business
computer
lcsh:Physics
Natural language processing
Zdroj: Applied Sciences
Volume 10
Issue 16
Applied Sciences, Vol 10, Iss 5630, p 5630 (2020)
ISSN: 2076-3417
DOI: 10.3390/app10165630
Popis: The usefulness of automated information extraction tools in generating structured knowledge from unstructured and semi-structured machine-readable documents is limited by challenges related to the variety and intricacy of the targeted entities, the complex linguistic features of heterogeneous corpora, and the computational availability for readily scaling to large amounts of text. In this paper, we argue that the redundancy and ambiguity of subject&ndash
predicate&ndash
object (SPO) triples in open information extraction systems has to be treated as an equally important step in order to ensure the quality and preciseness of generated triples. To this end, we propose a pipeline approach for information extraction from large corpora, encompassing a series of natural language processing tasks. Our methodology consists of four steps: i. in-place coreference resolution, ii. extractive text summarization, iii. parallel triple extraction, and iv. entity enrichment and graph representation. We manifest our methodology on a large medical dataset (CORD-19), relying on state-of-the-art tools to fulfil the aforementioned steps and extract triples that are subsequently mapped to a comprehensive ontology of biomedical concepts. We evaluate the effectiveness of our information extraction method by comparing it in terms of precision, recall, and F1-score with state-of-the-art OIE engines and demonstrate its capabilities on a set of data exploration tasks.
Databáze: OpenAIRE