Ensemble learning by data resampling

Autor: Goebel, Michael
Jazyk: angličtina
Rok vydání: 2004
Popis: We investigate ensemble learning methods that construct a classifier ensemble by repeatedly sampling the original training data and building a member classifier from each subsample. We find that the performance of standard Bagging can frequently be improved upon by simple variations of the sampling scheme, such as varying the sample size, or sampling without replacement instead of sampling with replacement. For all methods tested, the ensemble performance is greatly dependent on properties of the problem domain and the data sample. We try to explain the observed performances of the various ensemble methods qualitatively and quantitatively, and find that current ensemble analysis methods such as margin distributions, Error diagrams, bias-variance decomposition etc. are not well suited for this task. We postulate that the primary explanation for the performance of ensemble methods is to be found in their effects on accuracy of the ensemble members on one side and diversity among them on the other side, two contradictory goals which necessitate a compromise, or trade-off. This motivates the presentation of a precise yet general definition of what diversity is and how it is to be measured. This definition has the desirable property that it is applicable to all single-stage voting ensembles, under any given loss function. We then study the mathematical relationships between ensemble loss, mean member loss, and diversity. For squared loss, we show that our definitions lead to the well known ensemble loss decomposition, and extend this decomposition to the case where the ensemble members, instead of a real number, return a probability distribution over R. For the case of 0-1 loss, we derive the exact mathematical relations between ensemble loss, mean member loss, and diversity. Studying those relations provides some valuable insights into ensembles behavior, and produces some unexpected hence interesting results. These results are also confirmed by the experimental observations. Turning our attention back to the performance of Bagging variants, we show how the loss decomposition can be used to reduce the number of parameter settings which have to be tried out experimentally in order to find a well-performing ensemble method for a given particular problem.
Databáze: Networked Digital Library of Theses & Dissertations