Corpus-based Topic Derivation and Timestamp-based Popular Hashtag Prediction in Twitter

Autor: Sharath Kumar B R, 尚庫柏
Rok vydání: 2016
Druh dokumentu: 學位論文 ; thesis
Popis: 105
With the use of the Internet, mobile platforms, online commerce, and social media services, the footprints of human behavior can be easily recorded in the digital world, which generates data on an extremely large scale. Twitter as a big data social network becomes one of the most important sources for capturing up-to-date events happened in the world. Deriving topics from Twitter is important for various applications, such as situation awareness, market analysis, content filtering, and recommendations. However, topic derivation with high purity in Twitter is hard to achieve because tweets are limited to 140 characters. Previous works on topic derivation in Twitter suffer from low purity. In this thesis, we propose corpus-based topic derivation (CTD) approach which combines Twitter corpus and Latent Feature LDA (LF-LDA), which is a text processing model, to identify topics and clusters of similar hashtags. We use asymmetric topic LF-LDA to obtain better purity of topics. Compared to intJNMF, a representative related work, the purity (F-measure) of our proposed CTD increase from 5.26% (27.81%) to 11.32% (34.28%) for 20 to 100 topics. We also propose a timestamp-based popular hashtags prediction (TPHP) approach by creating trending hashtags lists (THLs), which are lists of hashtags used by many users and make use of timestamps in tweets. We use the edit distance to find the difference between consecutive THLs. Then this difference can be used to calculate volatility to find how people react to real world events. Compared to Hybrid+, a representative related work, the mean average precision of our TPHP increases by 19.45% (week-day), 15.08% (week-week) and 16.95% (month-week).
Databáze: Networked Digital Library of Theses & Dissertations