Automated extraction of information from free text of Spanish oncology pathology reports.

Autor: Mendoza-Urbano DM; Universidad Nacional de Colombia, Facultad de Medicina, Departamento de Patología, Bogotá, Colombia., Garcia JF; Quantil SAS. Bogotá, Colombia., Moreno JS; Quantil SAS. Bogotá, Colombia.; Centro de Analítica para Políticas Públicas. Bogotá, Colombia., Bravo-Ocaña JC; Fundación Valle del Lili; Departamento de Patología, Cali, Colombia., Riascos AJ; Quantil SAS. Bogotá, Colombia.; Centro de Analítica para Políticas Públicas. Bogotá, Colombia.; Universidad de los Andes, Facultad de Economía. Bogotá, Colombia., Zambrano Harvey A; Fundación Valle del Lili; Departamento de Hemato-Oncología, Cali, Colombia., Prada SI; Fundación Valle del Lili, Centro de Investigaciones Clínicas, Cali, Colombia.; Universidad Icesi, Centro PROESA, Cali, Colombia.
Jazyk: angličtina
Zdroj: Colombia medica (Cali, Colombia) [Colomb Med (Cali)] 2023 Mar 30; Vol. 54 (1), pp. e2035300. Date of Electronic Publication: 2023 Mar 30 (Print Publication: 2023).
DOI: 10.25100/cm.v54i1.5300
Abstrakt: Background: Pathology reports are stored as unstructured, ungrammatical, fragmented, and abbreviated free text with linguistic variability among pathologists. For this reason, tumor information extraction requires a significant human effort. Recording data in an efficient and high-quality format is essential in implementing and establishing a hospital-based-cancer registry.
Objective: This study aimed to describe implementing a natural language processing algorithm for oncology pathology reports.
Methods: An algorithm was developed to process oncology pathology reports in Spanish to extract 20 medical descriptors. The approach is based on the successive coincidence of regular expressions.
Results: The validation was performed with 140 pathological reports. The topography identification was performed manually by humans and the algorithm in all reports. The human identified morphology in 138 reports and by the algorithm in 137. The average fuzzy matching score was 68.3 for Topography and 89.5 for Morphology.
Conclusions: A preliminary algorithm validation against human extraction was performed over a small set of reports with satisfactory results. This shows that a regular-expression approach can accurately and precisely extract multiple specimen attributes from free-text Spanish pathology reports. Additionally, we developed a website to facilitate collaborative validation at a larger scale which may be helpful for future research on the subject.
Competing Interests: Conflict of interests: authors declare no conflict of interest.
(Copyright © 2023 Colombia Medica.)
Databáze: MEDLINE