Detecting Spam Tweets using Character N-gram Features
Autor: | Mokhtar Ashour, Cherif Salama, M. Watheq El-Kharashi |
---|---|
Rok vydání: | 2018 |
Předmět: |
n-gram
Information retrieval Computer science Character (computing) 020204 information systems Feature extraction 0202 electrical engineering electronic engineering information engineering Feature (machine learning) 020201 artificial intelligence & image processing 02 engineering and technology Latency (engineering) Popularity Word (computer architecture) |
Zdroj: | 2018 13th International Conference on Computer Engineering and Systems (ICCES). |
DOI: | 10.1109/icces.2018.8639297 |
Popis: | Twitter popularity made it an important and instantaneous source of news and trending events around the world. It has attracted the attention of spammers who post malicious content embedded in tweets and in their profile pages. Spammers use different and evolving techniques to evade traditional security mechanisms, and that creates the need to develop robust solutions that adapt with these techniques. In this paper, we propose using a low-level character n-grams feature that avoids the use of tokenizers or any language dependent tools. Using a publicly available dataset, we evaluate the performance of multiple ma-chine learning classifiers with different representations of the proposed feature. Our experiments show that our approach is an enhancement over the approaches that use word n-grams from tweet tokens. We also show that our technique can detect spam tweets with low latency which is crucial in a real-time environment like twitter. |
Databáze: | OpenAIRE |
Externí odkaz: |