Using ICD-9 diagnostic codes for external validation of topic models derived from primary care electronic medical record clinical text data
Autor: | Christopher Meaney, Michael Escobar, Therese A Stukel, Peter C Austin, Sumeet Kalia, Babak Aliarzadeh, null Rahim Moineddin, Michelle Greiver |
---|---|
Rok vydání: | 2023 |
Předmět: | |
Zdroj: | Health Informatics Journal. 29:146045822211156 |
ISSN: | 1741-2811 1460-4582 |
DOI: | 10.1177/14604582221115667 |
Popis: | Background/Objectives: Unsupervised topic models are often used to facilitate improved understanding of large unstructured clinical text datasets. In this study we investigated how ICD-9 diagnostic codes, collected alongside clinical text data, could be used to establish concurrent-, convergent- and discriminant-validity of learned topic models. Design/Setting: Retrospective open cohort design. Data were collected from primary care clinics located in Toronto, Canada between 01/01/2017 through 12/31/2020. Methods: We fit a non-negative matrix factorization topic model, with K = 50 latent topics/themes, to our input document term matrix (DTM). We estimated the magnitude of association between each Boolean-valued ICD-9 diagnostic code and each continuous latent topical vector. We identified ICD-9 diagnostic codes most strongly associated with each latent topical vector; and qualitatively interpreted how these codes could be used for external validation of the learned topic model. Results: The DTM consisted of 382,666 documents and 2210 words/tokens. We correlated concurrently assigned ICD-9 diagnostic codes with learned topical vectors, and observed semantic agreement for a subset of latent constructs (e.g. conditions of the breast, disorders of the female genital tract, respiratory disease, viral infection, eye/ear/nose/throat conditions, conditions of the urinary system, and dermatological conditions, etc.). Conclusions: When fitting topic models to clinical text corpora, researchers can leverage contemporaneously collected electronic medical record data to investigate the external validity of fitted latent variable models. |
Databáze: | OpenAIRE |
Externí odkaz: |