Generation of Gaussian sets for clustering methods assessment
Autor: | Nacéra Benamrane, Mohammed Ouali, Radhwane Gherbaoui |
---|---|
Rok vydání: | 2021 |
Předmět: |
Information Systems and Management
business.industry Computer science Gaussian Pattern recognition 02 engineering and technology 01 natural sciences Fuzzy logic Data set Set (abstract data type) 010104 statistics & probability symbols.namesake Expectation–maximization algorithm 0202 electrical engineering electronic engineering information engineering symbols 020201 artificial intelligence & image processing Sensitivity (control systems) Artificial intelligence 0101 mathematics business Cluster analysis Generator (mathematics) |
Zdroj: | Data & Knowledge Engineering. :101876 |
ISSN: | 0169-023X |
DOI: | 10.1016/j.datak.2021.101876 |
Popis: | Clustering methods are generally used to study the homogeneity in a set of observations. The results obtained from the clustering process differ from one method to another, to the extent that the same method or validity index gives different outcomes depending on the initial parameters. Analytical evaluation appears to be insufficient for studying the behavior of clustering methods due to its ad hoc nature. Even if the real data set is used in evaluating clustering methods, artificial data is fundamental for assessing the performance since it allows creating different scenarios of test with known structures. The main drawback of existing methods of artificial data is that they do not take into consideration the problem of sensitivity to the size of clusters. In this paper, we propose an automatic method: the high-dimensional artificial Gaussian mixture generator. By formally quantifying the overlap, the generator preserves the notion of the overlap rate between the mixture components. The advantages of this generator are its use of the notion of overlap rate, the unlimited number of mixture components, high-dimensionality of the observations, and the non-utilization of visual inspection as a criterion to quantify the overlap. In addition, we evaluate the k-means, fuzzy c-means (FCM), FCM-based splitting algorithm (FBSA), and expectation maximization (EM) in different dimensions. The results obtained confirm previous work and reveal new findings that are not pointed out when using 1D and 2D artificial data. 1 |
Databáze: | OpenAIRE |
Externí odkaz: |