Clustering Short Text and Its Evaluation
Autor: | Prajol Shrestha, Christine Jacquin, Béatrice Daille |
---|---|
Rok vydání: | 2012 |
Předmět: |
Clustering high-dimensional data
DBSCAN Fuzzy clustering Computer science Correlation clustering Rand index Single-linkage clustering Conceptual clustering computer.software_genre Biclustering CURE data clustering algorithm Consensus clustering Cluster analysis k-medians clustering Brown clustering business.industry Latent semantic analysis Cosine similarity Dendrogram Pattern recognition Spectral clustering Hierarchical clustering Data stream clustering Canopy clustering algorithm Vector space model Affinity propagation FLAME clustering Artificial intelligence Data mining business computer |
Zdroj: | Computational Linguistics and Intelligent Text Processing ISBN: 9783642286001 CICLing (2) |
DOI: | 10.1007/978-3-642-28601-8_15 |
Popis: | Recently there has been an increase in interest towards clustering short text because it could be used in many NLP applications. According to the application, a variety of short text could be defined mainly in terms of their length (e.g. sentence, paragraphs) and type (e.g. scientific papers, newspapers). Finding a clustering method that is able to cluster short text in general is difficult. In this paper, we cluster 4 different corpora with different types of text with varying length and evaluate them against the gold standard. Based on these clustering experiments, we show how different similarity measures, clustering algorithms, and cluster evaluation methods effect the resulting clusters. We discuss four existing corpus based similarity methods, Cosine similarity, Latent Semantic Analysis, Short text Vector Space Model, and Kullback-Leibler distance, four well known clustering methods, Complete Link, Single Link, Average Link hierarchical clustering and Spectral clustering, and three evaluation methods, clustering F-measure, adjusted Rand Index, and V. Our experiments show that corpus based similarity measures do not significantly affect the clusters and that the performance of spectral clustering is better than hierarchical clustering. We also show that the values given by the evaluation methods do not always represent the usability of the clusters. |
Databáze: | OpenAIRE |
Externí odkaz: |