On Distributed Fuzzy Decision Trees for Big Data
Autor: | Witold Pedrycz, Armando Segatori, Francesco Marcelloni |
---|---|
Rok vydání: | 2018 |
Předmět: |
Fuzzy classification
Computer science Decision tree 02 engineering and technology Apache spark big data fuzzy decision trees (FDTs) fuzzy discretizer fuzzy entropy fuzzy partitioning MapReduce Machine learning computer.software_genre Fuzzy logic Apache spark big data Artificial Intelligence 020204 information systems 0202 electrical engineering electronic engineering information engineering fuzzy decision trees (FDTs) MapReduce fuzzy partitioning fuzzy discretizer Fuzzy rule business.industry Applied Mathematics Tree (data structure) fuzzy entropy Computational Theory and Mathematics Information Fuzzy Networks Control and Systems Engineering Scalability Fuzzy set operations 020201 artificial intelligence & image processing Artificial intelligence Data mining business computer |
Zdroj: | IEEE Transactions on Fuzzy Systems. 26:174-192 |
ISSN: | 1941-0034 1063-6706 |
DOI: | 10.1109/tfuzz.2016.2646746 |
Popis: | Fuzzy decision trees (FDTs) have shown to be an effective solution in the framework of fuzzy classification. The approaches proposed so far to FDT learning, however, have generally neglected time and space requirements. In this paper, we propose a distributed FDT learning scheme shaped according to the MapReduce programming model for generating both binary and multiway FDTs from big data. The scheme relies on a novel distributed fuzzy discretizer that generates a strong fuzzy partition for each continuous attribute based on fuzzy information entropy. The fuzzy partitions are, therefore, used as an input to the FDT learning algorithm, which employs fuzzy information gain for selecting the attributes at the decision nodes. We have implemented the FDT learning scheme on the Apache Spark framework. We have used ten real-world publicly available big datasets for evaluating the behavior of the scheme along three dimensions: 1) performance in terms of classification accuracy, model complexity, and execution time; 2) scalability varying the number of computing units; and 3) ability to efficiently accommodate an increasing dataset size. We have demonstrated that the proposed scheme turns out to be suitable for managing big datasets even with a modest commodity hardware support. Finally, we have used the distributed decision tree learning algorithm implemented in the MLLib library and the Chi-FRBCS-BigData algorithm, a MapReduce distributed fuzzy rule-based classification system, for comparative analysis. |
Databáze: | OpenAIRE |
Externí odkaz: |