Resampling Techniques in Cluster Analysis: Is Subsampling Better Than Bootstrapping?

Autor: Hans-Georg Bartel, Hans-Joachim Mucha
Rok vydání: 2015
Předmět:
Zdroj: Data Science, Learning by Latent Structures, and Knowledge Discovery ISBN: 9783662449820
ECDA
DOI: 10.1007/978-3-662-44983-7_10
Popis: In the case of two small toy data sets, we found out that subsampling has a much weaker behavior in the finding of the true number of clusters K than bootstrapping (Mucha and Bartel, Soft bootstrapping in cluster analysis and its comparison with other resampling methods. In: M. Spiliopoulou, L. Schmidt-Thieme, R. Janning (eds.) Data analysis, machine learning and knowledge discovery. Springer, Cham, 2014). In contradiction, Moller and Dorte (Intell Data Anal 10:139–162, 2006) pointed out that “subsampling … clearly outperformed the bootstrapping technique in the detection of correct clustering consensus results.” Obviously, there is a need for further investigations. Therefore here we compare these two resampling techniques based on real and artificial data sets by means of different indices: ARI or Jaccard. We consider hierarchical cluster analysis methods because they find all partitions into K = 2, 3, … clusters in one run only, and, moreover, these results are (usually) unique (Spaeth, Cluster analysis algorithms for data reduction and classification of objects. Ellis Horwood, Chichester, 1982). The methods are tested on two synthetic data sets and two real data sets. Obviously, bootstrapping is better than subsampling in finding the true number of clusters.
Databáze: OpenAIRE