Hierarchical Label Propagation and Discovery for Machine Generated Email
Autor: | Michael Bendersky, Sujith Ravi, James B. Wendt, Marc-Allen Cartright, Lluis Garcia-Pueyo, Amitabh Saikia, Jie Yang, Balint Miklos, Ivo Krka, Vanja Josifovski |
---|---|
Rok vydání: | 2016 |
Předmět: |
Information retrieval
Hierarchy (mathematics) Matching (graph theory) Computer science Document classification 02 engineering and technology Dynamic web page Document clustering computer.software_genre Template 020204 information systems 0202 electrical engineering electronic engineering information engineering Graph (abstract data type) 020201 artificial intelligence & image processing Data mining Representation (mathematics) computer |
Zdroj: | WSDM |
DOI: | 10.1145/2835776.2835780 |
Popis: | Machine-generated documents such as email or dynamic web pages are single instantiations of a pre-defined structural template. As such, they can be viewed as a hierarchy of template and document specific content. This hierarchical template representation has several important advantages for document clustering and classification. First, templates capture common topics among the documents, while filtering out the potentially noisy variabilities such as personal information. Second, template representations scale far better than document representations since a single template captures numerous documents. Finally, since templates group together structurally similar documents, they can propagate properties between all the documents that match the template. In this paper, we use these advantages for document classification by formulating an efficient and effective hierarchical label propagation and discovery algorithm. The labels are propagated first over a template graph (constructed based on either term-based or topic-based similarities), and then to the matching documents. We evaluate the performance of the proposed algorithm using a large donated email corpus and show that the resulting template graph is significantly more compact than the corresponding document graph and the hierarchical label propagation is both efficient and effective in increasing the coverage of the baseline document classification algorithm. We demonstrate that the template label propagation achieves more than 91% precision and 93% recall, while increasing the label coverage by more than 11%. |
Databáze: | OpenAIRE |
Externí odkaz: |