Feature grouping-based parallel outlier mining of categorical data using spark

Autor: Xiao Qin, Yaling Xun, Jifu Zhang, Junli Li
Rok vydání: 2019
Předmět:
Zdroj: Information Sciences. 504:1-19
ISSN: 0020-0255
Popis: This paper proposes a feature-grouping based parallel outlier mining method called POS for high-dimensional categorical datasets. Existing methods of outlier mining are inadequate to deal with datasets which are so voluminous and complex. We solve this problem by proposing a parallel framework using the Spark platform for categorical and mass data. POS is composed of two modules, which are parallel feature grouping, and parallel outlier mining. Additionally, Vertical transformation is utilized to improve the performance of POS. We implement our POS on the Spark platform and evaluate it using synthetic and real-world datasets. Our experimental results confirm that POS is a promising and practical parallel algorithm to mine outliers in high-dimensional categorical datasets because POS achieves high performance in terms of extensibility and scalability.
Databáze: OpenAIRE