Word2Vec inversion and traditional text classifiers for phenotyping lupus

Autor: Paul E. Anderson, Jihad S. Obeid, Jim C. Oates, Cassios K. Marques, Diane L. Kamen, Clayton A. Turner, Alexander D. Jacobs
Jazyk: angličtina
Rok vydání: 2017
Předmět:
020205 medical informatics
Computer science
Bayesian probability
Datasets as Topic
Health Informatics
02 engineering and technology
computer.software_genre
Machine Learning
03 medical and health sciences
Naive Bayes classifier
0302 clinical medicine
Systemic lupus erythematosus
Artificial Intelligence
International Classification of Diseases
0202 electrical engineering
electronic engineering
information engineering

Electronic Health Records
Humans
Lupus Erythematosus
Systemic

Word2vec
030212 general & internal medicine
Artificial neural network
Receiver operating characteristic
business.industry
Health Policy
Natural language processing
Unified Medical Language System
Pattern recognition
Bayes Theorem
Computer Science Applications
Random forest
Support vector machine
Artificial intelligence
Neural Networks
Computer

business
computer
Algorithms
Research Article
Zdroj: BMC Medical Informatics and Decision Making
ISSN: 1472-6947
Popis: Background Identifying patients with certain clinical criteria based on manual chart review of doctors’ notes is a daunting task given the massive amounts of text notes in the electronic health records (EHR). This task can be automated using text classifiers based on Natural Language Processing (NLP) techniques along with pattern recognition machine learning (ML) algorithms. The aim of this research is to evaluate the performance of traditional classifiers for identifying patients with Systemic Lupus Erythematosus (SLE) in comparison with a newer Bayesian word vector method. Methods We obtained clinical notes for patients with SLE diagnosis along with controls from the Rheumatology Clinic (662 total patients). Sparse bag-of-words (BOWs) and Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs) matrices were produced using NLP pipelines. These matrices were subjected to several different NLP classifiers: neural networks, random forests, naïve Bayes, support vector machines, and Word2Vec inversion, a Bayesian inversion method. Performance was measured by calculating accuracy and area under the Receiver Operating Characteristic (ROC) curve (AUC) of a cross-validated (CV) set and a separate testing set. Results We calculated the accuracy of the ICD-9 billing codes as a baseline to be 90.00% with an AUC of 0.900, the shallow neural network with CUIs to be 92.10% with an AUC of 0.970, the random forest with BOWs to be 95.25% with an AUC of 0.994, the random forest with CUIs to be 95.00% with an AUC of 0.979, and the Word2Vec inversion to be 90.03% with an AUC of 0.905. Conclusions Our results suggest that a shallow neural network with CUIs and random forests with both CUIs and BOWs are the best classifiers for this lupus phenotyping task. The Word2Vec inversion method failed to significantly beat the ICD-9 code classification, but yielded promising results. This method does not require explicit features and is more adaptable to non-binary classification tasks. The Word2Vec inversion is hypothesized to become more powerful with access to more data. Therefore, currently, the shallow neural networks and random forests are the desirable classifiers.
Databáze: OpenAIRE