A comparison of model-based imputation methods for handling missing predictor values in a linear regression model: A simulation study

Autor:	Balkish Mohd Osman, Shamsiah Sapri, Sanizah Ahmad, Nadirah Othman, Haliza Hasan
Rok vydání:	2017
Předmět:	General linear model Proper linear model Computer science Linear regression Statistics Covariate Econometrics Regression analysis Imputation (statistics) Missing data Factor regression model
Zdroj:	AIP Conference Proceedings.
ISSN:	0094-243X
DOI:	10.1063/1.4995930
Popis:	In regression analysis, missing covariate data has been a common problem. Many researchers use ad hoc methods to overcome this problem due to the ease of implementation. However, these methods require assumptions about the data that rarely hold in practice. Model-based methods such as Maximum Likelihood (ML) using the expectation maximization (EM) algorithm and Multiple Imputation (MI) are more promising when dealing with difficulties caused by missing data. Then again, inappropriate methods of missing value imputation can lead to serious bias that severely affects the parameter estimates. The main objective of this study is to provide a better understanding regarding missing data concept that can assist the researcher to select the appropriate missing data imputation methods. A simulation study was performed to assess the effects of different missing data techniques on the performance of a regression model. The covariate data were generated using an underlying multivariate normal distribution and the dependent variable was generated as a combination of explanatory variables. Missing values in covariate were simulated using a mechanism called missing at random (MAR). Four levels of missingness (10%, 20%, 30% and 40%) were imposed. ML and MI techniques available within SAS software were investigated. A linear regression analysis was fitted and the model performance measures; MSE, and R-Squared were obtained. Results of the analysis showed that MI is superior in handling missing data with highest R-Squared and lowest MSE when percent of missingness is less than 30%. Both methods are unable to handle larger than 30% level of missingness.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::f9bf0cefd1865bdf5857197319509129 https://doi.org/10.1063/1.4995930 Zobrazit plný text záznamu