A Distributed Big Data Discretization Algorithm Under Spark
Autor: | Jing Hua Zhu, Xia Jie Zhang, Yeung Chan |
---|---|
Rok vydání: | 2019 |
Předmět: | |
Zdroj: | Big Data ISBN: 9789811518980 Big Data (CCF) |
DOI: | 10.1007/978-981-15-1899-7_8 |
Popis: | Data discretization is one of the important steps of data preprocessing in data mining, which can improve the data quality and thus improve the accuracy and time performance of the subsequent learning process. In the era of big data, the traditional discretization method is no longer applicable and distributed discretization algorithms need to be designed. Hellinger-entropy as an important distance measurement method in information theory is context-sensitive and feature-sensitive and thus are abundant of useful information. Therefore, in this paper we implement a Hellinger-entropy based distributed discretization algorithm under Apache Spark. We first measure the divergence of discrete intervals using Hellinger-entropy. Then we select top-k boundary points according to the information provided by the divergence value of discrete intervals. Finally, we divide the continuous variable range into k discrete intervals. We verficate the distributed discretization performance in the preprocessing of random forest, Bayes and multilayer perceptron classification on real sensor big data sets. Experimental results show that the time performance and classification accuracy of the distributed discretization algorithm based on Hellinger-entropy proposed in this paper are better than the existing algorithms. |
Databáze: | OpenAIRE |
Externí odkaz: |