THU0556 MISSING DATA AND MULTIPLE IMPUTATION IN RHEUMATOID ARTHRITIS REGISTRIES USING SEQUENTIAL RANDOM FOREST METHOD

Autor: A. Al-Qadhi, Yaser Ali, N. Alhadhood, Adel Al-Awadhi, Aqeel Ghanem, Adeeba Al-Herz, F. Abutiban, Eman Hasan, E. Nahar, H. Behbehani, Khulood Saleh, Hebah Alhajeri, Mohammed Hussain, Sawsan Hayat, Ahmad Alenizi, Ahmad Alsaber, Waleed Al-Kandari, A. Aledei, Jiazhu Pan
Rok vydání: 2020
Předmět:
Zdroj: Annals of the Rheumatic Diseases. 79:519.1-519
ISSN: 1468-2060
0003-4967
Popis: Background:Missing data in clinical epidemiological researches violate the intention to treat principle,reduce statistical power and can induce bias if they are related to patient’s response to treatment. In multiple imputation (MI), covariates are included in the imputation equation to predict the values of missing data.Objectives:To find the best approach to estimate and impute the missing values in Kuwait Registry for Rheumatic Diseases (KRRD) patients data.Methods:A number of methods were implemented for dealing with missing data. These includedMultivariate imputation by chained equations(MICE),K-Nearest Neighbors(KNN),Bayesian Principal Component Analysis(BPCA),EM with Bootstrapping(Amelia II),Sequential Random Forest(MissForest) and mean imputation. Choosing the best imputation method wasjudged by the minimum scores ofRoot Mean Square Error(RMSE),Mean Absolute Error(MAE) andKolmogorov–Smirnov D test statistic(KS) between the imputed datapoints and the original datapoints that were subsequently sat to missing.Results:A total of 1,685 rheumatoid arthritis (RA) patients and 10,613 hospital visits were included in the registry. Among them, we found a number of variables that had missing values exceeding 5% of the total values. These included duration of RA (13.0%), smoking history (26.3%), rheumatoid factor (7.93%), anti-citrullinated peptide antibodies (20.5%), anti-nuclear antibodies (20.4%), sicca symptoms (19.2%), family history of a rheumatic disease (28.5%), steroid therapy (5.94%), ESR (5.16%), CRP (22.9%) and SDAI (38.0%), The results showed that among the methods used, MissForest gave the highest level of accuracy to estimate the missing values. It had the least imputation errors for both continuous and categorical variables at each frequency of missingness and it had the smallest prediction differences when the models used imputed laboratory values. In both data sets, MICE had the second least imputation errors and prediction differences, followed by KNN and mean imputation.Conclusion:MissForest is a highly accurate method of imputation for missing data in KRRD and outperforms other common imputation techniques in terms of imputation error and maintenance of predictive ability with imputed values in clinical predictive models. This approach can be used in registries to improve the accuracy of data, including the ones for rheumatoid arthritis patients.References:[1]Junninen, H.; Niska, H.; Tuppurainen, K.; Ruuskanen, J.; Kolehmainen, M. Methods for imputation ofmissing values in air quality data sets.Atmospheric Environment2004,38, 2895–2907.[2]Norazian, M.N.; Shukri, Y.A.; Azam, R.N.; Al Bakri, A.M.M. Estimation of missing values in air pollutiondata using single imputation techniques.ScienceAsia2008,34, 341–345.[3]Plaia, A.; Bondi, A. Single imputation method of missing values in environmental pollution data sets.Atmospheric Environment2006,40, 7316–7330.[4]Kabir, G.; Tesfamariam, S.; Hemsing, J.; Sadiq, R. Handling incomplete and missing data in water networkdatabase using imputation methods.Sustainable and Resilient Infrastructure2019, pp. 1–13.[5]Di Zio, M.; Guarnera, U.; Luzi, O. Imputation through finite Gaussian mixture models.ComputationalStatistics & Data Analysis2007,51, 5305–5316.Disclosure of Interests:None declared
Databáze: OpenAIRE