A cluster-based oversampling algorithm combining SMOTE and k-means for imbalanced medical data
Autor: | Nan Yin, Zhaozhao Xu, Xi Han, Yue Kou, Tiezheng Nie, Derong Shen |
---|---|
Rok vydání: | 2021 |
Předmět: |
Information Systems and Management
Computer science 05 social sciences Decision tree k-means clustering 050301 education Sample (statistics) 02 engineering and technology Computer Science Applications Theoretical Computer Science Random forest Artificial Intelligence Control and Systems Engineering 0202 electrical engineering electronic engineering information engineering Oversampling 020201 artificial intelligence & image processing Sensitivity (control systems) 0503 education Algorithm Software Cluster based Interpolation |
Zdroj: | Information Sciences. 572:574-589 |
ISSN: | 0020-0255 |
DOI: | 10.1016/j.ins.2021.02.056 |
Popis: | The algorithm of C4.5 decision tree has the advantages of high classification accuracy , fast calculation speed and comprehensible classification rules, so it is widely used for medical data analysis. However, for imbalanced medical data, the classification accuracy of decision trees-based models is not ideal. Therefore, this paper proposes a cluster-based oversampling algorithm (KNSMOTE) combining Synthetic minority oversampling technique (SMOTE) and k-means algorithm. The sample classes clustered by k -means and the original sample classes are calculated to select the ‘‘safe samples” whose sample classes have not been changed. The ‘‘safe samples” are linearly interpolated to synthesize the new samples. The improved SMOTE sets the oversampling ratio according to the imbalance ratio of the original samples, which is used to synthesize the samples whose number is the same as that of the original samples. Compared with other oversampling algorithms on 8 UCI datasets, our algorithm has achieved significant advantages. Our algorithm was applied to the medical datasets, and the average values of the Sensitivity and Specificity indexes of the Random forest (RF) algorithm were 99.84% and 99.56%, respectively. |
Databáze: | OpenAIRE |
Externí odkaz: |