Online template induction for machine-generated emails
Autor: | James B. Wendt, Sandeep Tata, Marc Najork, Michael Whittaker, Nick Edmonds |
---|---|
Rok vydání: | 2019 |
Předmět: |
Set (abstract data type)
Information extraction Database Computer science 020204 information systems 0202 electrical engineering electronic engineering information engineering General Engineering Key (cryptography) 020201 artificial intelligence & image processing 02 engineering and technology computer.software_genre computer Throughput (business) |
Zdroj: | Proceedings of the VLDB Endowment. 12:1235-1248 |
ISSN: | 2150-8097 |
DOI: | 10.14778/3342263.3342264 |
Popis: | In emails, information abounds. Whether it be a bill reminder, a hotel confirmation, or a shipping notification, our emails contain useful bits of information that enable a number of applications. Most of this email traffic is machine-generated, sent from a business to a human. These business-to-consumer emails are typically instantiated from a set of email templates, and discovering these templates is a key step in enabling a variety of intelligent experiences. Existing email information extraction systems typically separate information extraction into two steps: an offline template discovery process (called template induction) that is periodically run on a sample of emails, and an online email annotation process that applies discovered templates to emails as they arrive. Since information extraction requires an email's template to be known, any delay in discovering a newly created template causes missed extractions, lowering the overall extraction coverage. In this paper, we present a novel system called Crusher that discovers templates completely online, reducing template discovery delay from a week (for the existing MapReduce-based batch system) to minutes. Furthermore, Crusher has a resource consumption footprint that is significantly smaller than the existing batch system. We also report on the surprising lesson we learned that conventional stream processing systems do not present a good framework on which to build Crusher. Crusher delivers an order of magnitude more throughput than a prototype built using a stream processing engine. We hope that these lessons help designers of stream processing systems accommodate a broader range of applications like online template induction in the future. |
Databáze: | OpenAIRE |
Externí odkaz: |