A standardized analytics pipeline for reliable and rapid development and validation of prediction models using observational health data

Autor:	Chungsoo Kim, Jimyung Park, Martijn J. Schuemie, Clair Blacketer, Cynthia Yang, Sergio Fernandez-Bertolin, Seng Chan You, Jenna Reps, S Khalid, Rae Woong Park, Anthony G. Sena, Marc A. Suchard, Peter R. Rijnbeek, Talita Duarte-Salles
Přispěvatelé:	Medical Informatics
Jazyk:	angličtina
Rok vydání:	2021
Předmět:	Artificial Intelligence and Image Processing COVID19 Computer science Pipeline (computing) observational health data Biomedical Engineering Decision tree Bioengineering Health Informatics Machine learning computer.software_genre Article Machine Learning OHDSI Humans AdaBoost Electrical and Electronic Engineering Data quality control Pandemics prediction modeling business.industry SARS-CoV-2 Standardized approach Data harmonization COVID-19 Risk prediction Computer Science Applications Random forest Phenotypes Networking and Information Technology R&D Distributed data network Logistic Models Networking and Information Technology R&D (NITRD) Analytics Generic health relevance Gradient boosting Artificial intelligence business computer Medical Informatics Software Predictive modelling
Zdroj:	Computer Methods and Programs in Biomedicine Computer Methods and Programs in Biomedicine, 211:106394. Elsevier Ireland Ltd
ISSN:	1872-7565 0169-2607
Popis:	Author(s): Khalid, Sara; Yang, Cynthia; Blacketer, Clair; Duarte-Salles, Talita; Fernandez-Bertolin, Sergio; Kim, Chungsoo; Park, Rae Woong; Park, Jimyung; Schuemie, Martijn J; Sena, Anthony G; Suchard, Marc A; You, Seng Chan; Rijnbeek, Peter R; Reps, Jenna M \| Abstract: Background and objectiveAs a response to the ongoing COVID-19 pandemic, several prediction models in the existing literature were rapidly developed, with the aim of providing evidence-based guidance. However, none of these COVID-19 prediction models have been found to be reliable. Models are commonly assessed to have a risk of bias, often due to insufficient reporting, use of non-representative data, and lack of large-scale external validation. In this paper, we present the Observational Health Data Sciences and Informatics (OHDSI) analytics pipeline for patient-level prediction modeling as a standardized approach for rapid yet reliable development and validation of prediction models. We demonstrate how our analytics pipeline and open-source software tools can be used to answer important prediction questions while limiting potential causes of bias (e.g., by validating phenotypes, specifying the target population, performing large-scale external validation, and publicly providing all analytical source code).MethodsWe show step-by-step how to implement the analytics pipeline for the question: 'In patients hospitalized with COVID-19, what is the risk of death 0 to 30 days after hospitalization?'. We develop models using six different machine learning methods in a USA claims database containing over 20,000 COVID-19 hospitalizations and externally validate the models using data containing over 45,000 COVID-19 hospitalizations from South Korea, Spain, and the USA.ResultsOur open-source software tools enabled us to efficiently go end-to-end from problem design to reliable Model Development and evaluation. When predicting death in patients hospitalized with COVID-19, AdaBoost, random forest, gradient boosting machine, and decision tree yielded similar or lower internal and external validation discrimination performance compared to L1-regularized logistic regression, whereas the MLP neural network consistently resulted in lower discrimination. L1-regularized logistic regression models were well calibrated.ConclusionOur results show that following the OHDSI analytics pipeline for patient-level prediction modelling can enable the rapid development towards reliable prediction models. The OHDSI software tools and pipeline are open source and available to researchers from all around the world.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::579d6f3087b0512fe0e133288951debc https://doi.org/10.1016/j.cmpb.2021.106394 Zobrazit plný text záznamu Full Text from ScienceDirect