Improving natural language information extraction from cancer pathology reports using transfer learning and zero-shot string similarity

Autor:	Bin Yu, Briton Park, Nicholas Altieri, John DeNero, Anobel Y. Odisho
Rok vydání:	2021
Předmět:	Pathology medicine.medical_specialty AcademicSubjects/SCI01060 Information extraction Computer science Health Informatics Research and Applications computer.software_genre Annotation Prior probability Machine learning medicine cancer natural language processing Lung business.industry Deep learning Natural language processing Lung Cancer Class (biology) Networking and Information Technology R&D pathology Artificial intelligence AcademicSubjects/SCI01530 String metric AcademicSubjects/MED00010 Transfer of learning business computer Natural language
Zdroj:	JAMIA open, vol 4, iss 3 JAMIA Open
Popis:	ObjectiveWe develop natural language processing (NLP) methods capable of accurately classifying tumor attributes from pathology reports given minimal labeled examples. Our hierarchical cancer to cancer transfer (HCTC) and zero-shot string similarity (ZSS) methods are designed to exploit shared information between cancers and auxiliary class features, respectively, to boost performance using enriched annotations which give both location-based information and document level labels for each pathology report.Materials and MethodsOur data consists of 250 pathology reports each for kidney, colon, and lung cancer from 2002 to 2019 from a single institution (UCSF). For each report, we classified 5 attributes: procedure, tumor location, histology, grade, and presence of lymphovascular invasion. We develop novel NLP techniques involving transfer learning and string similarity trained on enriched annotations. We compare HCTC and ZSS methods to the state-of-the-art including conventional machine learning methods as well as deep learning methods.ResultsFor our HCTC method, we see an improvement of up to 0.1 micro-F1 score and 0.04 macro-F1 averaged across cancer and applicable attributes. For our ZSS method, we see an improvement of up to 0.26 micro-F1 and 0.23 macro-F1 averaged across cancer and applicable attributes. These comparisons are made after adjusting training data sizes to correct for the 20% increase in annotation time for enriched annotations compared to ordinary annotations.ConclusionsMethods based on transfer learning across cancers and augmenting information methods with string similarity priors can significantly reduce the amount of labeled data needed for accurate information extraction from pathology reports.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::b878ed89e3746bd4f0131f2ec34ba78c https://escholarship.org/uc/item/4hc8g9v3 Zobrazit plný text záznamu