Text mining of verbal autopsy narratives to extract mortality causes and most prevalent diseases using natural language processing.

Autor: Mapundu MT; Department of Epidemiology and Biostatistics, School of Public Health, University of the Witwatersrand, Johannesburg, South Africa., Kabudula CW; Department of Epidemiology and Biostatistics, School of Public Health, University of the Witwatersrand, Johannesburg, South Africa.; MRC/Wits Rural Public Health and Health Transitions Research Unit (Agincourt), Johannesburg, South Africa., Musenge E; Department of Epidemiology and Biostatistics, School of Public Health, University of the Witwatersrand, Johannesburg, South Africa., Olago V; National Health Laboratory Service (NHLS), National Cancer Registry, Johannesburg, South Africa., Celik T; Wits Institute of Data Science, University of The Witwatersrand, Johannesburg, South Africa.; School of Electrical and Information Engineering, University of The Witwatersrand, Johannesburg, South Africa.
Jazyk: angličtina
Zdroj: PloS one [PLoS One] 2024 Sep 19; Vol. 19 (9), pp. e0308452. Date of Electronic Publication: 2024 Sep 19 (Print Publication: 2024).
DOI: 10.1371/journal.pone.0308452
Abstrakt: Verbal autopsy (VA) narratives play a crucial role in understanding and documenting the causes of mortality, especially in regions lacking robust medical infrastructure. In this study, we propose a comprehensive approach to extract mortality causes and identify prevalent diseases from VA narratives utilizing advanced text mining techniques, so as to better understand the underlying health issues leading to mortality. Our methodology integrates n-gram-based language processing, Latent Dirichlet Allocation (LDA), and BERTopic, offering a multi-faceted analysis to enhance the accuracy and depth of information extraction. This is a retrospective study that uses secondary data analysis. We used data from the Agincourt Health and Demographic Surveillance Site (HDSS), which had 16338 observations collected between 1993 and 2015. Our text mining steps entailed data acquisition, pre-processing, feature extraction, topic segmentation, and discovered knowledge. The results suggest that the HDSS population may have died from mortality causes such as vomiting, chest/stomach pain, fever, coughing, loss of weight, low energy, headache. Additionally, we discovered that the most prevalent diseases entailed human immunodeficiency virus (HIV), tuberculosis (TB), diarrhoea, cancer, neurological disorders, malaria, diabetes, high blood pressure, chronic ailments (kidney, heart, lung, liver), maternal and accident related deaths. This study is relevant in that it avails valuable insights regarding mortality causes and most prevalent diseases using novel text mining approaches. These results can be integrated in the diagnosis pipeline for ease of human annotation and interpretation. As such, this will help with effective informed intervention programmes that can improve primary health care systems and chronic based delivery, thus increasing life expectancy.
Competing Interests: The authors have declared that no competing interests exist.
(Copyright: © 2024 Mapundu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.)
Databáze: MEDLINE
Nepřihlášeným uživatelům se plný text nezobrazuje