Assessing the data complexity of imbalanced datasets

Autor: André C. P. L. F. de Carvalho, Ana Carolina Lorena, Marcilio C. P. de Souto, Luís P. F. Garcia, Victor H. Barella
Přispěvatelé: Universidade de São Paulo (USP), Laboratoire d'Informatique Fondamentale d'Orléans (LIFO), Université d'Orléans (UO)-Institut National des Sciences Appliquées - Centre Val de Loire (INSA CVL), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA), Technological Institute of Aeronautics (ITA)
Rok vydání: 2021
Předmět:
Zdroj: Information Sciences
Information Sciences, Elsevier, 2021, 553, pp.83-109. ⟨10.1016/j.ins.2020.12.006⟩
Repositório Institucional da USP (Biblioteca Digital da Produção Intelectual)
Universidade de São Paulo (USP)
instacron:USP
ISSN: 0020-0255
Popis: Imbalanced datasets are an important challenge in supervised Machine Learning (ML). According to the literature, class imbalance does not necessarily impose difficulties for ML algorithms. Difficulties mainly arise from other characteristics, such as overlapping between classes and complex decision boundaries. For binary classification tasks, calculating imbalance is straightforward, e.g., the ratio between class sizes. However, measuring more relevant characteristics, such as class overlapping, is not trivial. In the past years, complexity measures able to assess more relevant dataset characteristics have been proposed. In this paper, we investigate their effectiveness on real imbalanced datasets and how they are affected by applying different data imbalance treatments (DIT). For such, we perform two data-driven experiments: (1) We adapt the complexity measures to the context of imbalanced datasets. The experimental results show that our proposed measures assess the difficulty of imbalanced problems better than the original ones. We also compare the results with the state-of-art on data complexity measures for imbalanced datasets. (2) We analyze the behavior of complexity measures before and after applying DITs. According to the results, the difference in data complexity, in general, correlates to the predictive performance improvement obtained by applying DITs to the original datasets.
Databáze: OpenAIRE