An evaluation of statistical spam filtering techniques
Autor: | Le Zhang, Tianshun Yao, Jingbo Zhu |
---|---|
Rok vydání: | 2004 |
Předmět: |
General Computer Science
Computer science business.industry Supervised learning Feature selection Pattern recognition Machine learning computer.software_genre Support vector machine Naive Bayes classifier Feature Dimension Bag-of-words model AdaBoost Feature hashing Artificial intelligence business computer |
Zdroj: | ACM Transactions on Asian Language Information Processing. 3:243-269 |
ISSN: | 1558-3430 1530-0226 |
DOI: | 10.1145/1039621.1039625 |
Popis: | This paper evaluates five supervised learning methods in the context of statistical spam filtering. We study the impact of different feature pruning methods and feature set sizes on each learner's performance using cost-sensitive measures. It is observed that the significance of feature selection varies greatly from classifier to classifier. In particular, we found support vector machine, AdaBoost, and maximum entropy model are top performers in this evaluation, sharing similar characteristics: not sensitive to feature selection strategy, easily scalable to very high feature dimension, and good performances across different datasets. In contrast, naive Bayes, a commonly used classifier in spam filtering, is found to be sensitive to feature selection methods on small feature set, and fails to function well in scenarios where false positives are penalized heavily. The experiments also suggest that aggressive feature pruning should be avoided when building filters to be used in applications where legitimate mails are assigned a cost much higher than spams (such as λ = 999), so as to maintain a better-than-baseline performance. An interesting finding is the effect of mail headers on spam filtering, which is often ignored in previous studies. Experiments show that classifiers using features from message header alone can achieve comparable or better performance than filters utilizing body features only. This implies that message headers can be reliable and powerfully discriminative feature sources for spam filtering. |
Databáze: | OpenAIRE |
Externí odkaz: |