SICE: an improved missing data imputation technique

Autor: Abu Sayed Md. Latiful Hoque, Shahidul Islam Khan
Rok vydání: 2020
Předmět:
Multivariate statistics
lcsh:Computer engineering. Computer hardware
Information Systems and Management
Computer Networks and Communications
Computer science
Big data
Single Imputation
Binary number
Multiple Imputation
lcsh:TK7885-7895
02 engineering and technology
computer.software_genre
01 natural sciences
lcsh:QA75.5-76.95
010104 statistics & probability
0202 electrical engineering
electronic engineering
information engineering

Imputation (statistics)
0101 mathematics
Categorical variable
lcsh:T58.5-58.64
lcsh:Information technology
business.industry
Research
Data Analytics
Missing data
MICE
Hardware and Architecture
Data_GENERAL
Binary data
Data analysis
020201 artificial intelligence & image processing
lcsh:Electronic computers. Computer science
Data mining
business
Missing Data Imputation
computer
Information Systems
Zdroj: Journal of Big Data, Vol 7, Iss 1, Pp 1-21 (2020)
Journal of Big Data
ISSN: 2196-1115
DOI: 10.1186/s40537-020-00313-w
Popis: In data analytics, missing data is a factor that degrades performance. Incorrect imputation of missing values could lead to a wrong prediction. In this era of big data, when a massive volume of data is generated in every second, and utilization of these data is a major concern to the stakeholders, efficiently handling missing values becomes more important. In this paper, we have proposed a new technique for missing data imputation, which is a hybrid approach of single and multiple imputation techniques. We have proposed an extension of popular Multivariate Imputation by Chained Equation (MICE) algorithm in two variations to impute categorical and numeric data. We have also implemented twelve existing algorithms to impute binary, ordinal, and numeric missing values. We have collected sixty-five thousand real health records from different hospitals and diagnostic centers of Bangladesh, maintaining the privacy of data. We have also collected three public datasets from the UCI Machine Learning Repository, ETH Zurich, and Kaggle. We have compared the performance of our proposed algorithms with existing algorithms using these datasets. Experimental results show that our proposed algorithm achieves 20% higher F-measure for binary data imputation and 11% less error for numeric data imputations than its competitors with similar execution time.
Databáze: OpenAIRE