Popis: |
Even as the web 2.0 grows, e-mail continues to be one of the most used forms of communication in the Internet, being responsible for the generation of huge amounts of data. Spam traffic, for example, accounts for terabytes of data daily. It becomes necessary to create tools that are able to process these data efficiently, in large volumes, in order to understand their characteristics. Although mail servers are able to receive and store messages as they arrive, applying complex algorithms to a large set of mailboxes, either for characterization, security reasons or for data mining goals is challenging. Big data processing environments such as Hadoop are useful for the analysis of large data sets, although originally designed to handle text files in general. In this paper we present a Hadoop extension used to process and analyze large sets of e-mail, organized in mailboxes. To evaluate it, we used gigabytes of real spam traffic data collected around the world and we showed that our approach is efficient to process large amounts of mail data. |