Title
Online Template Induction for Machine-Generated Emails.
Abstract
In emails, information abounds. Whether it be a bill reminder, a hotel confirmation, or a shipping notification, our emails contain useful bits of information that enable a number of applications. Most of this email traffic is machine-generated, sent from a business to a human. These business-to-consumer emails are typically instantiated from a set of email templates, and discovering these templates is a key step in enabling a variety of intelligent experiences. Existing email information extraction systems typically separate information extraction into two steps: an offline template discovery process (called template induction) that is periodically run on a sample of emails, and an online email annotation process that applies discovered templates to emails as they arrive. Since information extraction requires an email's template to be known, any delay in discovering a newly created template causes missed extractions, lowering the overall extraction coverage. In this paper, we present a novel system called Crusher that discovers templates completely online, reducing template discovery delay from a week (for the existing MapReduce-based batch system) to minutes. Furthermore, Crusher has a resource consumption footprint that is significantly smaller than the existing batch system. We also report on the surprising lesson we learned that conventional stream processing systems do not present a good framework on which to build Crusher. Crusher delivers an order of magnitude more throughput than a prototype built using a stream processing engine. We hope that these lessons help designers of stream processing systems accommodate a broader range of applications like online template induction in the future.
Year
DOI
Venue
2019
10.14778/3342263.3342264
PVLDB
Field
DocType
Volume
Resource consumption,Annotation,Computer science,Information extraction,Batch processing,Throughput,Template,Business process discovery,Stream processing,Database
Journal
12
Issue
ISSN
Citations 
11
2150-8097
0
PageRank 
References 
Authors
0.34
0
5
Name
Order
Citations
PageRank
Michael J. Whittaker100.68
Nick Edmonds200.68
Sandeep Tata347827.50
James Bradley Wendt416911.61
Marc A. Najork52538278.16