Assessing the data complexity of imbalanced datasets

Autor:	André C. P. L. F. de Carvalho, Ana Carolina Lorena, Marcilio C. P. de Souto, Luís P. F. Garcia, Victor H. Barella
Přispěvatelé:	Universidade de São Paulo (USP), Laboratoire d'Informatique Fondamentale d'Orléans (LIFO), Université d'Orléans (UO)-Institut National des Sciences Appliquées - Centre Val de Loire (INSA CVL), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA), Technological Institute of Aeronautics (ITA)
Rok vydání:	2021
Předmět:	Information Systems and Management Computer science Context (language use) 02 engineering and technology Data complexity Machine learning computer.software_genre [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI] Theoretical Computer Science MATEMÁTICA DA COMPUTAÇÃO Artificial Intelligence 0202 electrical engineering electronic engineering information engineering ComputingMilieux_MISCELLANEOUS business.industry 05 social sciences 050301 education Class (biology) Computer Science Applications ComputingMethodologies_PATTERNRECOGNITION Binary classification Control and Systems Engineering 020201 artificial intelligence & image processing Artificial intelligence business 0503 education computer Software
Zdroj:	Information Sciences Information Sciences, Elsevier, 2021, 553, pp.83-109. ⟨10.1016/j.ins.2020.12.006⟩ Repositório Institucional da USP (Biblioteca Digital da Produção Intelectual) Universidade de São Paulo (USP) instacron:USP
ISSN:	0020-0255
Popis:	Imbalanced datasets are an important challenge in supervised Machine Learning (ML). According to the literature, class imbalance does not necessarily impose difficulties for ML algorithms. Difficulties mainly arise from other characteristics, such as overlapping between classes and complex decision boundaries. For binary classification tasks, calculating imbalance is straightforward, e.g., the ratio between class sizes. However, measuring more relevant characteristics, such as class overlapping, is not trivial. In the past years, complexity measures able to assess more relevant dataset characteristics have been proposed. In this paper, we investigate their effectiveness on real imbalanced datasets and how they are affected by applying different data imbalance treatments (DIT). For such, we perform two data-driven experiments: (1) We adapt the complexity measures to the context of imbalanced datasets. The experimental results show that our proposed measures assess the difficulty of imbalanced problems better than the original ones. We also compare the results with the state-of-art on data complexity measures for imbalanced datasets. (2) We analyze the behavior of complexity measures before and after applying DITs. According to the results, the difference in data complexity, in general, correlates to the predictive performance improvement obtained by applying DITs to the original datasets.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::4142f54211096b5a9c2eadd4c1f6b582 https://doi.org/10.1016/j.ins.2020.12.006 Zobrazit plný text záznamu Full Text from ScienceDirect