Popis: |
We first estimate the number of Italian users active on Twitter in the last year by filtering the Italian flow of Twitter. We show that our filter misses about the 6.86% of the Italian flow, while 86.80% of the selected tweets belongs to the Italian language. Given this accuracy of the Italian Twitter's Firehose filter, we are able to assess the actual number of the Italian active users (AUs) of this platform. We then introduce a massive text document clustering algorithm that is easily applicable and scalable to the Twitter social network. Instead of a topic modeling approach based on features selection and any conventional clustering algorithm, such as LDA, we apply community detection algorithms on the weighted hashtag graph . In order to scale with the graph size, we apply two linear community detection algorithms, CoDA and Louvain. Once the hashtags have been assigned to clusters, both the most numerous clusters and hashtags were associated with topics of general interest, such as sports, politics, health etc. In this way we are able to provide significant statistics of the topics covered on Twitter in the past year. |