FasTag: Automatic text classification of unstructured medical narratives.

Autor: Venkataraman GR; Department of Biomedical Data Science, School of Medicine, Stanford University, Stanford, CA, United States of America., Pineda AL; Department of Biomedical Data Science, School of Medicine, Stanford University, Stanford, CA, United States of America., Bear Don't Walk Iv OJ; Department of Biomedical Informatics, Vagelos College of Physicians and Surgeons, Columbia University, New York, NY, United States of America., Zehnder AM; Fauna Bio, San Francisco, CA, United States of America., Ayyar S; Department of Biomedical Data Science, School of Medicine, Stanford University, Stanford, CA, United States of America., Page RL; Department of Clinical Sciences, College of Veterinary Medicine and Biomedical Sciences, Colorado State University, Fort Collins, CO, United States of America., Bustamante CD; Department of Biomedical Data Science, School of Medicine, Stanford University, Stanford, CA, United States of America.; Chan Zuckerberg Biohub, San Francisco, CA, United States of America., Rivas MA; Department of Biomedical Data Science, School of Medicine, Stanford University, Stanford, CA, United States of America.
Jazyk: angličtina
Zdroj: PloS one [PLoS One] 2020 Jun 22; Vol. 15 (6), pp. e0234647. Date of Electronic Publication: 2020 Jun 22 (Print Publication: 2020).
DOI: 10.1371/journal.pone.0234647
Abstrakt: Unstructured clinical narratives are continuously being recorded as part of delivery of care in electronic health records, and dedicated tagging staff spend considerable effort manually assigning clinical codes for billing purposes. Despite these efforts, however, label availability and accuracy are both suboptimal. In this retrospective study, we aimed to automate the assignment of top-level International Classification of Diseases version 9 (ICD-9) codes to clinical records from human and veterinary data stores using minimal manual labor and feature curation. Automating top-level annotations could in turn enable rapid cohort identification, especially in a veterinary setting. To this end, we trained long short-term memory (LSTM) recurrent neural networks (RNNs) on 52,722 human and 89,591 veterinary records. We investigated the accuracy of both separate-domain and combined-domain models and probed model portability. We established relevant baseline classification performances by training Decision Trees (DT) and Random Forests (RF). We also investigated whether transforming the data using MetaMap Lite, a clinical natural language processing tool, affected classification performance. We showed that the LSTM-RNNs accurately classify veterinary and human text narratives into top-level categories with an average weighted macro F1 score of 0.74 and 0.68 respectively. In the "neoplasia" category, the model trained on veterinary data had a high validation accuracy in veterinary data and moderate accuracy in human data, with F1 scores of 0.91 and 0.70 respectively. Our LSTM method scored slightly higher than that of the DT and RF models. The use of LSTM-RNN models represents a scalable structure that could prove useful in cohort identification for comparative oncology studies. Digitization of human and veterinary health information will continue to be a reality, particularly in the form of unstructured narratives. Our approach is a step forward for these two domains to learn from and inform one another.
Competing Interests: CDB is Principal and Chairman of CDB Consulting LTD. He has advised Fauna Bio, Inc., Imprimed, Embark Vet and Etalon DX as a member of their respective Scientific Advisory Boards, and is a Director of Etalon DX. AMZ is the CEO of Fauna Bio, Inc. MAR is on the SAB of 54Gene and has advised BioMarin, MazeTx, Related Sciences, and Goldfinch Bio. ALP declares that the research presented in this study was done while he was employed by Stanford University, but at the time of submission, he is now employed by Genentech, Inc., a member of the Roche group. This does not alter our adherence to PLOS ONE policies on sharing data and materials. The remaining authors declare no conflicts of interest.
Databáze: MEDLINE
Nepřihlášeným uživatelům se plný text nezobrazuje