An evaluation of statistical spam filtering techniques

Autor:	Le Zhang, Tianshun Yao, Jingbo Zhu
Rok vydání:	2004
Předmět:	General Computer Science Computer science business.industry Supervised learning Feature selection Pattern recognition Machine learning computer.software_genre Support vector machine Naive Bayes classifier Feature Dimension Bag-of-words model AdaBoost Feature hashing Artificial intelligence business computer
Zdroj:	ACM Transactions on Asian Language Information Processing. 3:243-269
ISSN:	1558-3430 1530-0226
DOI:	10.1145/1039621.1039625
Popis:	This paper evaluates five supervised learning methods in the context of statistical spam filtering. We study the impact of different feature pruning methods and feature set sizes on each learner's performance using cost-sensitive measures. It is observed that the significance of feature selection varies greatly from classifier to classifier. In particular, we found support vector machine, AdaBoost, and maximum entropy model are top performers in this evaluation, sharing similar characteristics: not sensitive to feature selection strategy, easily scalable to very high feature dimension, and good performances across different datasets. In contrast, naive Bayes, a commonly used classifier in spam filtering, is found to be sensitive to feature selection methods on small feature set, and fails to function well in scenarios where false positives are penalized heavily. The experiments also suggest that aggressive feature pruning should be avoided when building filters to be used in applications where legitimate mails are assigned a cost much higher than spams (such as λ = 999), so as to maintain a better-than-baseline performance. An interesting finding is the effect of mail headers on spam filtering, which is often ignored in previous studies. Experiments show that classifiers using features from message header alone can achieve comparable or better performance than filters utilizing body features only. This implies that message headers can be reliable and powerfully discriminative feature sources for spam filtering.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::551e667bcc05603d6e7795be097ee615 https://doi.org/10.1145/1039621.1039625 Zobrazit plný text záznamu