SICE: an improved missing data imputation technique
Autor: | Abu Sayed Md. Latiful Hoque, Shahidul Islam Khan |
---|---|
Rok vydání: | 2020 |
Předmět: |
Multivariate statistics
lcsh:Computer engineering. Computer hardware Information Systems and Management Computer Networks and Communications Computer science Big data Single Imputation Binary number Multiple Imputation lcsh:TK7885-7895 02 engineering and technology computer.software_genre 01 natural sciences lcsh:QA75.5-76.95 010104 statistics & probability 0202 electrical engineering electronic engineering information engineering Imputation (statistics) 0101 mathematics Categorical variable lcsh:T58.5-58.64 lcsh:Information technology business.industry Research Data Analytics Missing data MICE Hardware and Architecture Data_GENERAL Binary data Data analysis 020201 artificial intelligence & image processing lcsh:Electronic computers. Computer science Data mining business Missing Data Imputation computer Information Systems |
Zdroj: | Journal of Big Data, Vol 7, Iss 1, Pp 1-21 (2020) Journal of Big Data |
ISSN: | 2196-1115 |
DOI: | 10.1186/s40537-020-00313-w |
Popis: | In data analytics, missing data is a factor that degrades performance. Incorrect imputation of missing values could lead to a wrong prediction. In this era of big data, when a massive volume of data is generated in every second, and utilization of these data is a major concern to the stakeholders, efficiently handling missing values becomes more important. In this paper, we have proposed a new technique for missing data imputation, which is a hybrid approach of single and multiple imputation techniques. We have proposed an extension of popular Multivariate Imputation by Chained Equation (MICE) algorithm in two variations to impute categorical and numeric data. We have also implemented twelve existing algorithms to impute binary, ordinal, and numeric missing values. We have collected sixty-five thousand real health records from different hospitals and diagnostic centers of Bangladesh, maintaining the privacy of data. We have also collected three public datasets from the UCI Machine Learning Repository, ETH Zurich, and Kaggle. We have compared the performance of our proposed algorithms with existing algorithms using these datasets. Experimental results show that our proposed algorithm achieves 20% higher F-measure for binary data imputation and 11% less error for numeric data imputations than its competitors with similar execution time. |
Databáze: | OpenAIRE |
Externí odkaz: |