A Distributed Methodology for Imbalanced Classification Problems

Autor: Rodica Potolea, Andrei Alic, Adrian Bona, Camelia Lemnaru, Mihai Cuibus
Rok vydání: 2012
Předmět:
Zdroj: ISPDC
DOI: 10.1109/ispdc.2012.30
Popis: Current important challenges in data mining research are triggered by the need to address various particularities of real-world problems, such as imbalanced data and error cost distributions. This paper presents Distributed Evolutionary Cost-Sensitive Balancing, a distributed methodology for dealing with imbalanced data and -- if necessary -- cost distributions. The method employs a genetic algorithm to search for an optimal cost matrix and base classifier settings, which are then employed by a cost-sensitive classifier, wrapped around the base classifier. Individual fitness computation is the most intensive task in the algorithm, but it also presents a high parallelization potential. Two different parallelization alternatives have been explored: a computation-driven approach, and a data-driven approach. Both have been developed within the Apache Watchmaker framework and deployed on Hadoop-based infrastructures. Experimental evaluations performed up to this point have indicated that the computation-driven approach achieves a good classification performance, but does not reduce the running time significantly, the data-driven approach reduces the running time for slow algorithms, such as the kNN and the SVM, while still yielding important performance improvements.
Databáze: OpenAIRE