Augmenting small biomedical datasets using generative AI methods based on self-organizing neural networks.

Autor: Ultsch A; DataBionics Research Group, University of Marburg, Hans - Meerwein - Straße, 35032 Marburg, Germany., Lötsch J; Institute of Clinical Pharmacology, Goethe - University, Theodor - Stern - Kai 7, 60590 Frankfurt am Main, Germany.; Faculty of Medicine, University of Helsinki, Haartmaninkatu 8, 00014 Helsinki, Finland.; Fraunhofer Institute for Translational Medicine and Pharmacology (ITMP), Theodor-Stern-Kai 7, 60596 Frankfurt am Main, Germany.
Jazyk: angličtina
Zdroj: Briefings in bioinformatics [Brief Bioinform] 2024 Nov 22; Vol. 26 (1).
DOI: 10.1093/bib/bbae640
Abstrakt: Small sample sizes in biomedical research often led to poor reproducibility and challenges in translating findings into clinical applications. This problem stems from limited study resources, rare diseases, ethical considerations in animal studies, costly expert diagnosis, and others. As a contribution to the problem, we propose a novel generative algorithm based on self-organizing maps (SOMs) to computationally increase sample sizes. The proposed unsupervised generative algorithm uses neural networks to detect inherent structure even in small multivariate datasets, distinguishing between sparse "void" and dense "cloud" regions. Using emergent SOMs (ESOMs), the algorithm adapts to high-dimensional data structures and generates for each original data point k new points by randomly selecting positions within an adapted hypersphere with distances based on valid neighborhood probabilities. Experiments on artificial and biomedical (omics) datasets show that the generated data preserve the original structure without introducing artifacts. Random forests and support vector machines cannot distinguish between generated and original data, and the variables of original and generated data sets are not statistically different. The method successfully augments small group sizes, such as transcriptomics data from a rare form of leukemia and lipidomics data from arthritis research. The novel ESOM-based generative algorithm presents a promising solution for enhancing sample sizes in small or rare case datasets, even when limited training data are available. This approach can address challenges associated with small sample sizes in biomedical research, offering a tool for improving the reliability and robustness of scientific findings in this field. Availability: R library "Umatrix" (https://cran.r-project.org/package=Umatrix).
(© The Author(s) 2024. Published by Oxford University Press.)
Databáze: MEDLINE