I-TWEC: Interactive clustering tool for Twitter

Autor: Yucel Saygin, İnanç Arın, Mert Kemal Erpam
Rok vydání: 2018
Předmět:
Zdroj: Expert Systems with Applications. 96:1-13
ISSN: 0957-4174
DOI: 10.1016/j.eswa.2017.11.055
Popis: Social media provides a medium for people to express themselves on different issues. Twitter has gained a lot of popularity in the past decade as a social media platform where people create micro-blogs providing a valuable data source to understand trends and public opinion. However, the volume of tweets even on specific topics may reach millions creating a big challenge for the analyst. Clustering is a technique which can be utilized to better understand such large volumes of data. The main idea of clustering is to group similar tweets into batches in order to find patterns, to summarize, and to compress a large dataset. Though clustering is a natural technique for the analysis of tweets, there is no clustering tool specifically designed for Twitter data that utilizes lexical and semantic similarities; and that can be readily used by non-technical experts such as social scientists. I-TWEC is a web based tweet clustering tool where users can upload their data and the resulting clusters are presented with different visualizations which further enable the user to interactively select and merge clusters based on their semantic similarity. I-TWEC has the lexical and semantic clustering components implemented as two consecutive phases. For the lexical clustering of tweets, Longest Common Subsequence is a widely accepted similarity metric, however it is also very costly, and therefore not applicable to large data sets such as the ones collected through Twitter. In order to overcome that challenge, we have implemented a suffix tree based index structure in I-TWEC to efficiently cluster tweets based on the Longest Common Substring similarity which is an approximation of the Longest Common Subsequence. Experiments we have conducted show that lexical clustering phase of I-TWEC can produce results with comparable clustering quality in a fraction of the time required by the baseline methods which use Longest Common Subsequence and Suffix Tree. We have also experimented with a k-means document clustering as well as a state-of-the-art word-based suffix tree clustering algorithm and the results show that I-TWEC outperforms the state-of-the-art in terms of time with comparable clustering quality.
Databáze: OpenAIRE