Data Imputation for Symbolic Regression with Missing Values: A Comparative Study
Autor: | Qi Chen, Baligh Al-Helali, Mengjie Zhang, Bing Xue |
---|---|
Rok vydání: | 2020 |
Předmět: |
Computer science
business.industry Genetic programming 0102 computer and information sciences 02 engineering and technology Machine learning computer.software_genre Missing data 01 natural sciences Data modeling Data set 010201 computation theory & mathematics Data quality 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Imputation (statistics) Artificial intelligence business Symbolic regression computer Categorical variable |
Zdroj: | SSCI |
DOI: | 10.1109/ssci47803.2020.9308216 |
Popis: | Symbolic regression via genetic programming is considered as a crucial machine learning tool for empirical modelling. However, in reality, it is common for real-world data sets to have some data quality problems such as noise, outliers, and missing values. Although several approaches can be adopted to deal with data incompleteness in machine learning, most studies consider the classification tasks, and only a few have considered symbolic regression with missing values. In this work, the performance of symbolic regression using genetic programming on real-world data sets that have missing values is investigated. This is done by studying how different imputation methods affect symbolic regression performance. The experiments are conducted using thirteen real-world incomplete data sets with different ratios of missing values. The experimental results show that although the performance of the imputation methods differs with the data set, CART has a better effect than others. This might be due to its ability to deal with categorical and numerical variables. Moreover, the superiority of the use of imputation methods over the commonly used deletion strategy is observed. |
Databáze: | OpenAIRE |
Externí odkaz: |