Data mining methods for classification of Medium-Chain Acyl-CoA dehydrogenase deficiency (MCADD) using non-derivatized tandem MS neonatal screening data

Autor:	Toon Van Genechten, François Eyskens, Paul Vanden Broucke, Geert Smits, Sam Proesmans, Viviane Van Hoof, Seppe vanden Broucke, Kristien Wouters, Elke Smits, Tim Van den Bulcke
Jazyk:	angličtina
Předmět:	blood spots newborns Computer science diagnosis Population MCADD Logistic regression Feature selection Health Informatics computer.software_genre Acyl-CoA Dehydrogenase Lipid Metabolism Inborn Errors Neonatal Screening Text mining Belgium Artificial Intelligence Tandem Mass Spectrometry medicine Humans education Medium-Chain Acyl-CoA dehydrogenase Data mining education.field_of_study Newborn screening business.industry logistic regression medium-chain acyl-coa dehydrogenase Infant Newborn rare diseases acid oxidation disorders data mining mass-spectrometry medicine.disease Computer Science Applications Rare diseases Test set Hyperparameter optimization Human medicine business computer mcadd
Zdroj:	Journal of biomedical informatics
ISSN:	1532-0464
DOI:	10.1016/j.jbi.2010.12.001
Popis:	Newborn screening programs for severe metabolic disorders using tandem mass spectrometry are widely used. Medium-Chain Acyl-CoA dehydrogenase deficiency (MCADD) is the most prevalent mitochondrial fatty acid oxidation defect (1:15,000 newborns) and it has been proven that early detection of this metabolic disease decreases mortality and improves the outcome. In previous studies, data mining methods on derivatized tandem MS datasets have shown high classification accuracies. However, no machine learning methods currently have been applied to datasets based on non-derivatized screening methods. A dataset with 44,159 blood samples was collected using a non-derivatized screening method as part of a systematic newborn screening by the PCMA screening center (Belgium). Twelve MCADD cases were present in this partially MCADD-enriched dataset. We extended three data mining methods, namely C4.5 decision trees, logistic regression and ridge logistic regression, with a parameter and threshold optimization method and evaluated their applicability as a diagnostic support tool. Within a stratified cross-validation setting, a grid search was performed for each model for a wide range of model parameters, included variables and classification thresholds. The best performing model used ridge logistic regression and achieved a sensitivity of 100%, a specificity of 99.987% and a positive predictive value of 32% (recalibrated for a real population), obtained in a stratified cross-validation setting. These results were further validated on an independent test set. Using a method that combines ridge logistic regression with variable selection and threshold optimization, a significantly improved performance was achieved compared to the current state-of-the-art for derivatized data, while retaining more interpretability and requiring less variables. The results indicate the potential value of data mining methods as a diagnostic support tool.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::833c51af3c6a476371d938f6bee791df Zobrazit plný text záznamu