A standardized analytics pipeline for reliable and rapid development and validation of prediction models using observational health data
Autor: | Chungsoo Kim, Jimyung Park, Martijn J. Schuemie, Clair Blacketer, Cynthia Yang, Sergio Fernandez-Bertolin, Seng Chan You, Jenna Reps, S Khalid, Rae Woong Park, Anthony G. Sena, Marc A. Suchard, Peter R. Rijnbeek, Talita Duarte-Salles |
---|---|
Přispěvatelé: | Medical Informatics |
Jazyk: | angličtina |
Rok vydání: | 2021 |
Předmět: |
Artificial Intelligence and Image Processing
COVID19 Computer science Pipeline (computing) observational health data Biomedical Engineering Decision tree Bioengineering Health Informatics Machine learning computer.software_genre Article Machine Learning OHDSI Humans AdaBoost Electrical and Electronic Engineering Data quality control Pandemics prediction modeling business.industry SARS-CoV-2 Standardized approach Data harmonization COVID-19 Risk prediction Computer Science Applications Random forest Phenotypes Networking and Information Technology R&D Distributed data network Logistic Models Networking and Information Technology R&D (NITRD) Analytics Generic health relevance Gradient boosting Artificial intelligence business computer Medical Informatics Software Predictive modelling |
Zdroj: | Computer Methods and Programs in Biomedicine Computer Methods and Programs in Biomedicine, 211:106394. Elsevier Ireland Ltd |
ISSN: | 1872-7565 0169-2607 |
Popis: | Author(s): Khalid, Sara; Yang, Cynthia; Blacketer, Clair; Duarte-Salles, Talita; Fernandez-Bertolin, Sergio; Kim, Chungsoo; Park, Rae Woong; Park, Jimyung; Schuemie, Martijn J; Sena, Anthony G; Suchard, Marc A; You, Seng Chan; Rijnbeek, Peter R; Reps, Jenna M | Abstract: Background and objectiveAs a response to the ongoing COVID-19 pandemic, several prediction models in the existing literature were rapidly developed, with the aim of providing evidence-based guidance. However, none of these COVID-19 prediction models have been found to be reliable. Models are commonly assessed to have a risk of bias, often due to insufficient reporting, use of non-representative data, and lack of large-scale external validation. In this paper, we present the Observational Health Data Sciences and Informatics (OHDSI) analytics pipeline for patient-level prediction modeling as a standardized approach for rapid yet reliable development and validation of prediction models. We demonstrate how our analytics pipeline and open-source software tools can be used to answer important prediction questions while limiting potential causes of bias (e.g., by validating phenotypes, specifying the target population, performing large-scale external validation, and publicly providing all analytical source code).MethodsWe show step-by-step how to implement the analytics pipeline for the question: 'In patients hospitalized with COVID-19, what is the risk of death 0 to 30 days after hospitalization?'. We develop models using six different machine learning methods in a USA claims database containing over 20,000 COVID-19 hospitalizations and externally validate the models using data containing over 45,000 COVID-19 hospitalizations from South Korea, Spain, and the USA.ResultsOur open-source software tools enabled us to efficiently go end-to-end from problem design to reliable Model Development and evaluation. When predicting death in patients hospitalized with COVID-19, AdaBoost, random forest, gradient boosting machine, and decision tree yielded similar or lower internal and external validation discrimination performance compared to L1-regularized logistic regression, whereas the MLP neural network consistently resulted in lower discrimination. L1-regularized logistic regression models were well calibrated.ConclusionOur results show that following the OHDSI analytics pipeline for patient-level prediction modelling can enable the rapid development towards reliable prediction models. The OHDSI software tools and pipeline are open source and available to researchers from all around the world. |
Databáze: | OpenAIRE |
Externí odkaz: |