Title
Anatomy of a Privacy-Safe Large-Scale Information Extraction System Over Email
Abstract
Extracting structured data from emails can enable several assistive experiences, such as reminding the user when a bill payment is due, answering queries about the departure time of a booked flight, or proactively surfacing an emailed discount coupon while the user is at that store. This paper presents Juicer, a system for extracting information from email that is serving over a billion Gmail users daily. We describe how the design of the system was informed by three key principles: scaling to a planet-wide email service, isolating the complexity to provide a simple experience for the developer, and safeguarding the privacy of users (our team and the developers we support are not allowed to view any single email). We describe the design tradeoffs made in building this system, the challenges faced and the approaches used to tackle them. We present case studies of three extraction tasks implemented on this platform---bill reminders, commercial offers, and hotel reservations---to illustrate the effectiveness of the platform despite challenges unique to each task. Finally, we outline several areas of ongoing research in large-scale machine-learned information extraction from email.
Year
DOI
Venue
2018
10.1145/3219819.3219901
KDD
Keywords
Field
DocType
Information extraction,wrapper induction,email,document classification
Document classification,Coupon,World Wide Web,Computer science,Information extraction,Artificial intelligence,Safeguarding,Payment,Data model,Machine learning
Conference
ISBN
Citations 
PageRank 
978-1-4503-5552-0
0
0.34
References 
Authors
32
6
Name
Order
Citations
PageRank
ying sheng122.08
Sandeep Tata247827.50
James Wendt311.70
Jing Xie400.34
Qi Zhao55420.12
Marc A. Najork62538278.16