A scalable and transparent data pipeline for AI-enabled health data ecosystems

Autor:	Tuncay Namli, Ali Anıl Sınacı, Suat Gönül, Cristina Ruiz Herguido, Patricia Garcia-Canadilla, Adriana Modrego Muñoz, Arnau Valls Esteve, Gökçe Banu Laleci Ertürkmen
Jazyk:	angličtina
Rok vydání:	2024
Předmět:	artificial intelligence dataset harmonization transparency FHIR interoperability Medicine (General) R5-920
Zdroj:	Frontiers in Medicine, Vol 11 (2024)
Druh dokumentu:	article
ISSN:	2296-858X
DOI:	10.3389/fmed.2024.1393123
Popis:	IntroductionTransparency and traceability are essential for establishing trustworthy artificial intelligence (AI). The lack of transparency in the data preparation process is a significant obstacle in developing reliable AI systems which can lead to issues related to reproducibility, debugging AI models, bias and fairness, and compliance and regulation. We introduce a formal data preparation pipeline specification to improve upon the manual and error-prone data extraction processes used in AI and data analytics applications, with a focus on traceability.MethodsWe propose a declarative language to define the extraction of AI-ready datasets from health data adhering to a common data model, particularly those conforming to HL7 Fast Healthcare Interoperability Resources (FHIR). We utilize the FHIR profiling to develop a common data model tailored to an AI use case to enable the explicit declaration of the needed information such as phenotype and AI feature definitions. In our pipeline model, we convert complex, high-dimensional electronic health records data represented with irregular time series sampling to a flat structure by defining a target population, feature groups and final datasets. Our design considers the requirements of various AI use cases from different projects which lead to implementation of many feature types exhibiting intricate temporal relations.ResultsWe implement a scalable and high-performant feature repository to execute the data preparation pipeline definitions. This software not only ensures reliable, fault-tolerant distributed processing to produce AI-ready datasets and their metadata including many statistics alongside, but also serve as a pluggable component of a decision support application based on a trained AI model during online prediction to automatically prepare feature values of individual entities. We deployed and tested the proposed methodology and the implementation in three different research projects. We present the developed FHIR profiles as a common data model, feature group definitions and feature definitions within a data preparation pipeline while training an AI model for “predicting complications after cardiac surgeries”.DiscussionThrough the implementation across various pilot use cases, it has been demonstrated that our framework possesses the necessary breadth and flexibility to define a diverse array of features, each tailored to specific temporal and contextual criteria.
Databáze:	Directory of Open Access Journals
Externí odkaz:	https://doaj.org/article/4df59a96395d4753bac99f349c86bd49 Zobrazit plný text záznamu View record in DOAJ