I-TWEC: Interactive clustering tool for Twitter
Autor: | Yucel Saygin, İnanç Arın, Mert Kemal Erpam |
---|---|
Rok vydání: | 2018 |
Předmět: |
Information retrieval
Computer science Suffix tree General Engineering 02 engineering and technology Document clustering Suffix tree clustering Computer Science Applications Longest common substring problem law.invention Data set Longest common subsequence problem Semantic similarity Artificial Intelligence law 020204 information systems 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Social media Cluster analysis |
Zdroj: | Expert Systems with Applications. 96:1-13 |
ISSN: | 0957-4174 |
DOI: | 10.1016/j.eswa.2017.11.055 |
Popis: | Social media provides a medium for people to express themselves on different issues. Twitter has gained a lot of popularity in the past decade as a social media platform where people create micro-blogs providing a valuable data source to understand trends and public opinion. However, the volume of tweets even on specific topics may reach millions creating a big challenge for the analyst. Clustering is a technique which can be utilized to better understand such large volumes of data. The main idea of clustering is to group similar tweets into batches in order to find patterns, to summarize, and to compress a large dataset. Though clustering is a natural technique for the analysis of tweets, there is no clustering tool specifically designed for Twitter data that utilizes lexical and semantic similarities; and that can be readily used by non-technical experts such as social scientists. I-TWEC is a web based tweet clustering tool where users can upload their data and the resulting clusters are presented with different visualizations which further enable the user to interactively select and merge clusters based on their semantic similarity. I-TWEC has the lexical and semantic clustering components implemented as two consecutive phases. For the lexical clustering of tweets, Longest Common Subsequence is a widely accepted similarity metric, however it is also very costly, and therefore not applicable to large data sets such as the ones collected through Twitter. In order to overcome that challenge, we have implemented a suffix tree based index structure in I-TWEC to efficiently cluster tweets based on the Longest Common Substring similarity which is an approximation of the Longest Common Subsequence. Experiments we have conducted show that lexical clustering phase of I-TWEC can produce results with comparable clustering quality in a fraction of the time required by the baseline methods which use Longest Common Subsequence and Suffix Tree. We have also experimented with a k-means document clustering as well as a state-of-the-art word-based suffix tree clustering algorithm and the results show that I-TWEC outperforms the state-of-the-art in terms of time with comparable clustering quality. |
Databáze: | OpenAIRE |
Externí odkaz: |