Evaluations of imputation methods for missing data with random forest modeling: an application with unbalanced data with a three categories outcome, the linkage to HIV care Uganda study

Autor: Nadia B. Mendoza, Barbara A. Bailey, Susan M. Kiene, Nicolas A. Menzies, Rhoda K. Wanyenze, Katherine A. Schmarje, Michael Ediau, Seth C. Kalichman, Rose Naigino, Chii-Dean Lin
Rok vydání: 2023
Popis: Background: Incomplete observation units may contain important information about the population being studied, analysis conducted with the complete portion of the dataset, done in most standard statistical software, may produce biased or low statistical power results. In this paper we investigate the behavior of imputation methods for a linkage to HIV care study dataset with more than 60% incomplete cases.Methods: Missing data imputation algorithms amelia, missForest, mice and hmisc were considered. Two sets of simulations were conducted: first with the subset of data containing only complete observations and second with the whole dataset. Imputation accuracy and general behavior for each imputation algorithm were accessed in the first set. Random forest models were fit to evaluate overall prediction accuracy and sensitivity in both sets of simulations. Results: The imputed values by missForest, a single imputation method, were more accurate for all incomplete variables and scenarios. Median overall prediction accuracy of HIV status, a three levels outcome, was slightly higher after imputations with missForest and amelia (approximately 52%, 61% and 65% for samples of 350, 700 and 1050 units respectively). In general, different missing percentages (20%, 40% and 60%) did not result in large changes in prediction or imputation accuracy and larger sample sizes performance improved for imputation and overall prediction accuracy but mixed results were observed with sensitivity. For the second set of simulations, missForest and amelia produced better overall prediction accuracy with larger samples (1200 versus 350 units). For new HIV positive HIV class, hmisc and mice presented better sensitivity, with median sensitivity 65% versus 50% and approximately 90% versus 20% for samples with 350 and 1200 units respectively, compared to incomplete samples. Conclusions: Imputation with mice and hmisc may be used for imputation when the main interest is to increase sensitivity in predicting new HIV positive. Furthermore, it is interesting to investigate which method gives more advantage for other performance measures of interest. Trial registration: NCT02545673
Databáze: OpenAIRE