Hierarchical Label Propagation and Discovery for Machine Generated Email

Autor: Michael Bendersky, Sujith Ravi, James B. Wendt, Marc-Allen Cartright, Lluis Garcia-Pueyo, Amitabh Saikia, Jie Yang, Balint Miklos, Ivo Krka, Vanja Josifovski
Rok vydání: 2016
Předmět:
Zdroj: WSDM
DOI: 10.1145/2835776.2835780
Popis: Machine-generated documents such as email or dynamic web pages are single instantiations of a pre-defined structural template. As such, they can be viewed as a hierarchy of template and document specific content. This hierarchical template representation has several important advantages for document clustering and classification. First, templates capture common topics among the documents, while filtering out the potentially noisy variabilities such as personal information. Second, template representations scale far better than document representations since a single template captures numerous documents. Finally, since templates group together structurally similar documents, they can propagate properties between all the documents that match the template. In this paper, we use these advantages for document classification by formulating an efficient and effective hierarchical label propagation and discovery algorithm. The labels are propagated first over a template graph (constructed based on either term-based or topic-based similarities), and then to the matching documents. We evaluate the performance of the proposed algorithm using a large donated email corpus and show that the resulting template graph is significantly more compact than the corresponding document graph and the hierarchical label propagation is both efficient and effective in increasing the coverage of the baseline document classification algorithm. We demonstrate that the template label propagation achieves more than 91% precision and 93% recall, while increasing the label coverage by more than 11%.
Databáze: OpenAIRE