Popis: |
Current important challenges in data mining research are triggered by the need to address various particularities of real-world problems, such as imbalanced data and error cost distributions. This paper presents Distributed Evolutionary Cost-Sensitive Balancing, a distributed methodology for dealing with imbalanced data and -- if necessary -- cost distributions. The method employs a genetic algorithm to search for an optimal cost matrix and base classifier settings, which are then employed by a cost-sensitive classifier, wrapped around the base classifier. Individual fitness computation is the most intensive task in the algorithm, but it also presents a high parallelization potential. Two different parallelization alternatives have been explored: a computation-driven approach, and a data-driven approach. Both have been developed within the Apache Watchmaker framework and deployed on Hadoop-based infrastructures. Experimental evaluations performed up to this point have indicated that the computation-driven approach achieves a good classification performance, but does not reduce the running time significantly, the data-driven approach reduces the running time for slow algorithms, such as the kNN and the SVM, while still yielding important performance improvements. |