Data Imputation for Symbolic Regression with Missing Values: A Comparative Study

Autor: Qi Chen, Baligh Al-Helali, Mengjie Zhang, Bing Xue
Rok vydání: 2020
Předmět:
Zdroj: SSCI
DOI: 10.1109/ssci47803.2020.9308216
Popis: Symbolic regression via genetic programming is considered as a crucial machine learning tool for empirical modelling. However, in reality, it is common for real-world data sets to have some data quality problems such as noise, outliers, and missing values. Although several approaches can be adopted to deal with data incompleteness in machine learning, most studies consider the classification tasks, and only a few have considered symbolic regression with missing values. In this work, the performance of symbolic regression using genetic programming on real-world data sets that have missing values is investigated. This is done by studying how different imputation methods affect symbolic regression performance. The experiments are conducted using thirteen real-world incomplete data sets with different ratios of missing values. The experimental results show that although the performance of the imputation methods differs with the data set, CART has a better effect than others. This might be due to its ability to deal with categorical and numerical variables. Moreover, the superiority of the use of imputation methods over the commonly used deletion strategy is observed.
Databáze: OpenAIRE