COMPARING THE PREDICTIVE PERFORMANCE OF OLS AND 7 ROBUST LINEAR REGRESSION ESTIMATORS ON A REAL AND SIMULATED DATASETS

Autor: Sacha Varin
Rok vydání: 2021
Předmět:
Zdroj: International Journal of Engineering Applied Sciences and Technology. 5
ISSN: 2455-2143
DOI: 10.33564/ijeast.2021.v05i11.002
Popis: Robust regression techniques are relevant tools for investigating data contaminated with influential observations. The article briefly reviews and describes 7 robust estimators for linear regression, including popular ones (Huber M, Tukey’s bisquare M, least absolute deviation also called L1 or median regression), some that combine high breakdown and high efficiency [fast MM (Modified M-estimator), fast ?-estimator and HBR (High breakdown rank-based)], and one to handle small samples (Distance-constrained maximum likelihood (DCML)). We include the fast MM and fast ?-estimators because we use the fast-robust bootstrap (FRB) for MM and ?-estimators. Our objective is to compare the predictive performance on a real data application using OLS (Ordinary least squares) and to propose alternatives by using 7 different robust estimations. We also run simulations under various combinations of 4 factors: sample sizes, percentage of outliers, percentage of leverage and number of covariates. The predictive performance is evaluated by crossvalidation and minimizing the mean squared error (MSE). We use the R language for data analysis. In the real dataset OLS provides the best prediction. DCML and popular robust estimators give good predictive results as well, especially the Huber M-estimator. In simulations involving 3 predictors and n=50, the results clearly favor fast MM, fast ?-estimator and HBR whatever the proportion of outliers. DCML and Tukey M are also good estimators when n=50, especially when the percentage of outliers is small (5% and 10%%). With 10 predictors, however, HBR, fast MM, fast ? and especially DCML give better results for n=50. HBR, fast MM and DCML provide better results for n=500. For n=5000 all the robust estimators give the same results independently of the percentage of outliers. If we vary the percentages of outliers and leverage points simultaneously, DCML, fast MM and HBR are good estimators for n=50 and p=3. For n=500, fast MM, fast ? and HBR provi
Databáze: OpenAIRE