Fine-grained document clustering via ranking and its application to social media analytics
Autor: | Richi Nayak, Taufik Edy Sutanto |
---|---|
Rok vydání: | 2018 |
Předmět: |
Computer science
business.industry Communication Big data Unstructured data 02 engineering and technology Document clustering computer.software_genre Social media analytics Computer Science Applications Ranking (information retrieval) Human-Computer Interaction Set (abstract data type) ComputingMethodologies_PATTERNRECOGNITION 020204 information systems 0202 electrical engineering electronic engineering information engineering Media Technology Unsupervised learning 020201 artificial intelligence & image processing Data mining Cluster analysis business computer Information Systems |
Zdroj: | Social Network Analysis and Mining. 8 |
ISSN: | 1869-5469 1869-5450 |
DOI: | 10.1007/s13278-018-0508-z |
Popis: | Extracting valuable insights from a large volume of unstructured data such as texts through clustering analysis is paramount to many big data applications. However, document clustering is challenged by the computational complexity of the underlying methods and the high dimensionality of data, especially when the number of required clusters is large. A fine-grained clustering solution is required to understand a data set that represents heterogeneous topics such as social media data. This paper presents the Fine-Grained document Clustering via Ranking (FGCR) approach which leverages the search engine capability of handling big data efficiently. Ranking scores from a search engine are used to calculate dynamic clusters’ representations called loci in an unsupervised learning setting. Clustering decisions are efficiently made based on an optimal selection from a small subset of loci instead of the entire cluster set as in the conventional centroid-based clustering. A comprehensive empirical study on several social media data sets shows that FGCR is able to produce insightful and accurate fine-grained solution. Moreover, it is magnitudes faster and requires less computational resources compared to other state-of-the-art document clustering approaches. |
Databáze: | OpenAIRE |
Externí odkaz: |