Accounting for diverse feature-types improves patient stratification on tabular clinical datasets

Autor: Saptarshi Bej, Chaithra Umesh, Manjunath Mahendra, Kristian Schultz, Jit Sarkar, Olaf Wolkenhauer
Jazyk: angličtina
Rok vydání: 2023
Předmět:
Zdroj: Machine Learning with Applications, Vol 14, Iss , Pp 100490- (2023)
Druh dokumentu: article
ISSN: 2666-8270
DOI: 10.1016/j.mlwa.2023.100490
Popis: Tabular Clinical and Biomedical Routine Data (CBRD) contains diverse feature types. Recent research shows that the conventional application of Uniform Manifold Projection and Approximation (UMAP) to extract clusters from the low dimensional embedding can prove ineffective due to the diverse feature types in such datasets. Feature-type Distributed Clustering (FDC) workflow accounts for these diverse feature types resulting in a more informative low-dimensional embedding. However, a rigorous assessment of the FDC algorithm is missing so far. In this work, we conducted comprehensive benchmarking experiments to compare the quality of the cluster distributions and low dimensional embeddings generated by the FDC against that of the ones generated by UMAP using standard objective measures: Silhouette score, Dunn index, and ANOVA. Our results confirm that FDC can indeed be the better choice to embed tabular data with diverse feature types in low dimensions and thereby extract clusters from such an embedding. In addition, we provide a rationale behind the choice of metrics proposed in the FDC workflow. Moreover, we also point out some problems with the original Canberra metric used to reduce ordinal features in the FDC workflow and provide a solution in the form of a modified version of the Canberra metric. Using seven datasets from the medical domain for benchmarking, we demonstrate that FDC leads to improved patient stratification.
Databáze: Directory of Open Access Journals