Problem zmiennych redundantnych w metodzie lasów losowych

Autor:	Mariusz Kubus
Přispěvatelé:	Opole University of Technology, Faculty of Production Engineering and Logistics, Department of Mathematics and IT Applications
Jazyk:	angličtina
Rok vydání:	2018
Předmět:	random forests Computer science redundant variables Feature selection C1 feature selection C52 lcsh:Finance lcsh:HG1-9999 dobór zmiennych lcsh:HF5410-5417.5 C38 business.industry zmienne redundantne lcsh:Marketing. Distribution of products Dimensionality reduction Supervised learning Pattern recognition General Medicine Popularity Random forest Feature (computer vision) taksonomia cech Outlier Artificial intelligence business lasy losowe clustering of features
Zdroj:	Acta Universitatis Lodziensis. Folia Oeconomica, Vol 6, Iss 339, Pp 7-16 (2018)
ISSN:	0208-6018
Popis:	Random forests are currently one of the most preferable methods of supervised learning among practitioners. Their popularity is influenced by the possibility of applying this method without a time consuming pre‑processing step. Random forests can be used for mixed types of features, irrespectively of their distributions. The method is robust to outliers, and feature selection is built into the learning algorithm. However, a decrease of classification accuracy can be observed in the presence of redundant variables. In this paper, we discuss two approaches to the problem of redundant variables. We consider two strategies of searching for best feature subset as well as two formulas of aggregating the features in the clusters. In the empirical experiment, we generate collinear predictors and include them in the real datasets. Dimensionality reduction methods usually improve the accuracy of random forests, but none of them clearly outperforms the others. Lasy losowe są obecnie jedną z najchętniej stosowanych przez praktyków metod klasyfikacji wzorcowej. Na jej popularność wpływ ma możliwość jej stosowania bez czasochłonnego, wstępnego przygotowywania danych do analizy. Las losowy można stosować dla różnego typu zmiennych, niezależnie od ich rozkładów. Metoda ta jest odporna na obserwacje nietypowe oraz ma wbudowany mechanizm doboru zmiennych. Można jednak zauważyć spadek dokładności klasyfikacji w przypadku występowania zmiennych redundantnych. W artykule omawiane są dwa podejścia do problemu zmiennych redundantnych. Rozważane są dwa sposoby przeszukiwania w podejściu polegającym na doborze zmiennych oraz dwa sposoby konstruowania zmiennych syntetycznych w podejściu wykorzystującym grupowanie zmiennych. W eksperymencie generowane są liniowo zależne predyktory i włączane do zbiorów danych rzeczywistych. Metody redukcji wymiarowości zwykle poprawiają dokładność lasów losowych, ale żadna z nich nie wykazuje wyraźnej przewagi.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::bc9e8024d3656d7c1aca34c23ddb004d https://czasopisma.uni.lodz.pl/foe/article/view/2552 Zobrazit plný text záznamu