The quest for better clinical word vectors: ontology based and lexical vector augmentation versus clinical contextual embeddings

Autor:	Namrata Nath, Ivan Lee, Sang-Heon Lee, Mark D. McDonnell
Přispěvatelé:	Nath, Namrata, Lee, Sang-Heon, McDonnell, Mark, Lee, Ivan
Jazyk:	angličtina
Rok vydání:	2021
Předmět:	0301 basic medicine Word embedding Computer science Health Informatics Ontology (information science) computer.software_genre 03 medical and health sciences Annotation 0302 clinical medicine Named-entity recognition augmentation Electronic Health Records Humans Word2vec antonymy Hyponymy and hypernymy Natural Language Processing business.industry Unified Medical Language System word embedding Computer Science Applications 030104 developmental biology Artificial intelligence business computer clinical word vectors Algorithms 030217 neurology & neurosurgery Word (computer architecture) Natural language processing
Popis:	Background: Word vectors or word embeddings are n-dimensional representations of words and form the backbone of Natural Language Processing of textual data. This research experiments with algorithms that augment word vectors with lexical constraints that are popular in NLP research and clinical domain constraints derived from the Unified Medical Language System (UMLS). It also compares the performance of the augmented vectors with Bio + Clinical BERT vectors which have been trained and fine-tuned on clinical datasets. Methods: Word2vec vectors are generated for words in a publicly available de-identified Electronic Health Records (EHR) dataset and augmented by ontologies using three algorithms that have fundamentally different approaches to vector augmentation. The augmented vectors are then evaluated alongside publicly available Bio + Clinical BERT on their correlation with human-annotated lists using Spearman & rsquo;s correlation coefficient. They are also evaluated on the downstream task of Named Entity Recognition (NER). Quantitative and empirical evaluations are used to highlight the strengths and weaknesses of the different approaches. Results: The counter-fitted word2vec vectors augmented with information from the UMLS ontology produced the best correlation overall with human-annotated evaluation lists (Spearman & rsquo;s correlation of 0.733 with mini mayodoctors & rsquo; annotation) while Bio + Clinical BERT produces the best results in the NER task (F1 of 0.87 and 0.811 on the i2b2 2010 and i2b2 2012 datasets respectively) in our experiments. Conclusion: Clinically adapted word2vec vectors successfully encapsulate concepts of lexical and clinical synonymy and antonymy and to a smaller extent, hyponymy and hypernymy. Bio + Clinical BERT vectors perform better at NER and avoid out-of-vocabulary words. Refereed/Peer-reviewed
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::242c767a0e22d8ff829f3fdef544702e https://hdl.handle.net/11541.2/148139 Zobrazit plný text záznamu