Enhancing Case Capture, Quality, and Completeness of Primary Melanoma Pathology Records via Natural Language Processing

Autor: Lauren E. Haydu, Jeffrey E. Gershenwald, Shida Jin, Bryan Lari, Julie M. Simon, Trey Kell, Samuel P. Camp, Victor G. Prieto, Jared Malke
Rok vydání: 2019
Předmět:
Zdroj: JCO Clinical Cancer Informatics. :1-11
ISSN: 2473-4276
DOI: 10.1200/cci.19.00006
Popis: PURPOSE Medical records contain a wealth of useful, informative data points valuable for clinical research. Most data points are stored in semistructured or unstructured legacy documents and require manual data abstraction into a structured format to render the information more readily accessible, searchable, and generally analysis ready. The substantial labor needed for this can be cost prohibitive, particularly when dealing with large patient cohorts. METHODS To establish a high-throughput approach to data abstraction, we developed a novel framework using natural language processing (NLP) and a decision-rules algorithm to extract, transform, and load (ETL) melanoma primary pathology features from pathology reports in an institutional legacy electronic medical record system into a structured database. We compared a subset of these data with a manually curated data set comprising the same patients and developed a novel scoring system to assess confidence in records generated by the algorithm, thus obviating manual review of high-confidence records while flagging specific, low-confidence records for review. RESULTS The algorithm generated 368,624 individual melanoma data points comprising 16 primary tumor prognostic factors and metadata from 23,039 patients. From these data points, a subset of 147,872 was compared with an existing, manually abstracted data set, demonstrating an exact or synonymous match between 90.4% of all data points. Additionally, the confidence-scoring algorithm demonstrated an error rate of only 3.7%. CONCLUSION Our NLP platform can identify and abstract melanoma primary prognostic factors with accuracy comparable to that of manual abstraction (< 5% error rate), with vastly greater efficiency. Principles used in the development of this algorithm could be expanded to include other melanoma-specific data points as well as disease-agnostic fields and further enhance capture of essential elements from nonstructured data.
Databáze: OpenAIRE