Popis: |
Introduction: The Ohio Department of Health (ODH) collects and maintains records from disease intervention specialist (DIS) investigations for all syphilis cases reported to the state, including exposed partners who tested negative for syphilis. The records contain information in a structured form and in the form of free-text notes (unstructured). We sought to apply natural language processing (NLP) methods to 2019 Ohio DIS syphilis records, to (1) determine whether DIS notes contain novel characteristics, behaviors, or patterns that are not yet reported in the syphilis literature, and (2) to explore if NLP methods could be used to identify key topics in the unstructured notes.Methods: In Aim 1, we described the records and assessed feasibility of using these data for NLP analyses. We explored two approaches to numerically represent the unstructured text: (1) TF-IDF (term frequency, inverse document frequency), which measures the importance of words based on how many times they appear, and (2) GloVe pretrained word embeddings, which assign numerical vectors to words to encode their semantic meaning. In Aim 2, we performed agglomerative clustering using the structured data and unstructured text (using TF-IDF weights), with cosine similarity as the distance metric, to explore patterns in the data. In Aim 3, we explored if machine learning models could identify key topics in the unstructured text. To do this, we identified 21 key topics in the notes fields potentially relevant for syphilis transmission and DIS investigations. We manually coded these records to create “gold standard” labels for each topic (0=topic not present, 1=topic present), then trained machine learning models to identify the topics. Specifically, we explored three statistical models (naïve Bayes, support vector machine [SVM], and logistic regression) using TF-IDF, and one neural network model (long short-term memory [LSTM] model) using GloVe.Results: The cluster analysis (n=1,996) yielded 7 clusters of syphilis cases. The average internal similarities were much higher than the average external similarities, indicating that the clusters were well-formed. For three clusters, the factors underlying the clusters related to patterns of missing data. For the remaining four clusters, the factors underlying the clusters were sexual behaviors and sexual partnerships; one consisted of individuals who reported oral sex with male or anonymous partners and while intoxicated, one was comprised mainly of heterosexual men, and the other two formed based on combinations of variables about sexual partnerships and demographics. In the topic prediction analysis (n=1,987), for most topics, the LSTM model performed the best overall, and the SVM model performed the best among the statistical classifiers. For example, the LSTM model predicted the topic, “substance use,” with high accuracy (97%), sensitivity (92%), and specificity (98%). No model performed well for rare topics.Conclusions: Our analysis resulted clusters that did not reveal novel epidemiological information about syphilis risk factors and transmission. We also found that machine learning models are feasible for identifying topics in 2019 Ohio syphilis records. This project is a first step in applying NLP methods to DIS notes to make them more accessible for analysis by public health practitioners. |