Feature grouping-based parallel outlier mining of categorical data using spark
Autor: | Xiao Qin, Yaling Xun, Jifu Zhang, Junli Li |
---|---|
Rok vydání: | 2019 |
Předmět: |
Information Systems and Management
Computer science 05 social sciences Parallel algorithm 050301 education 02 engineering and technology computer.software_genre Computer Science Applications Theoretical Computer Science ComputingMethodologies_PATTERNRECOGNITION Transformation (function) Artificial Intelligence Control and Systems Engineering Outlier Spark (mathematics) 0202 electrical engineering electronic engineering information engineering Feature (machine learning) 020201 artificial intelligence & image processing Data mining 0503 education Categorical variable computer Software |
Zdroj: | Information Sciences. 504:1-19 |
ISSN: | 0020-0255 |
Popis: | This paper proposes a feature-grouping based parallel outlier mining method called POS for high-dimensional categorical datasets. Existing methods of outlier mining are inadequate to deal with datasets which are so voluminous and complex. We solve this problem by proposing a parallel framework using the Spark platform for categorical and mass data. POS is composed of two modules, which are parallel feature grouping, and parallel outlier mining. Additionally, Vertical transformation is utilized to improve the performance of POS. We implement our POS on the Spark platform and evaluate it using synthetic and real-world datasets. Our experimental results confirm that POS is a promising and practical parallel algorithm to mine outliers in high-dimensional categorical datasets because POS achieves high performance in terms of extensibility and scalability. |
Databáze: | OpenAIRE |
Externí odkaz: |