KGA: integrating KPCA and GAN for microbial data augmentation.

Autor: Wen, Liu-Ying, Zhang, Xiao-Min, Li, Qing-Feng, Min, Fan
Zdroj: International Journal of Machine Learning & Cybernetics; Apr2023, Vol. 14 Issue 4, p1427-1444, 18p
Abstrakt: The data used for microbial-based disease diagnosis are characterized by small sample sizes, imbalanced categories, high dimensionality, and strong sparsity. They pose challenges to machine learning algorithms that aim to achieve good classification performance. In this paper, we propose a two-stage data augmentation method to enhance training data quality. The first stage is feature transformation. We design a KPCA-based method to map microbial data to a low-rank feature space, resulting in cleaner and more efficient data representation. This processing step addresses high dimensionality and strong sparsity in microbial data. The second stage is data augmentation. New synthetic data are obtained by augmenting the positive samples through the GAN. The misclassification cost is used to control the ratio of positive/negative samples in new data. The combination of the augmented data with the original data constitutes a cost-sensitive dataset, which can increase sample diversity while addressing the imbalance problem. This is more reasonable than traditional sampling methods that resolve the class imbalance. We compare the new method with four popular data augmentation algorithms on 12 imbalanced datasets. The experimental results demonstrate that (1) the samples augmented by the proposed algorithm are more diverse than those generated using compared resampling methods, such as SMOTE_ENN, and (2) the proposed algorithm not only achieves the lowest total misclassification cost but also outperforms other methods in terms of F 2 and G-mean metrics. [ABSTRACT FROM AUTHOR]
Databáze: Complementary Index