SMGKM: An Efficient Incremental Algorithm for Clustering Document Collections

Autor: Bagirov, A, Seifollahi, S, Piccardi, M, Zare Borzeshi, E, Kruger, B
Jazyk: angličtina
Rok vydání: 2023
Předmět:
Popis: Given a large unlabeled document collection, the aim of this paper is to develop an accurate and efficient algorithm for solving the clustering problem over this collection. Document collections typically contain tens or hundreds of thousands of documents, with thousands or tens of thousands of features (i.e., distinct words). Most existing clustering algorithms struggle to find accurate solutions on such large data sets. The proposed algorithm overcomes this difficulty by an incremental approach, incrementing the number of clusters progressively from an initial value of one to a set value. At each iteration, the new candidate cluster is initialized using a partitioning approach which is guaranteed to minimize the objective function. Experiments have been carried out over six, diverse datasets and with different evaluation criteria, showing that the proposed algorithm has outperformed comparable state-of-the-art clustering algorithms in all cases.
Databáze: OpenAIRE