Title
RiSER: Learning Better Representations for Richly Structured Emails
Abstract
Recent studies show that an overwhelming majority of emails are machine-generated and sent by businesses to consumers. Many large email services are interested in extracting structured data from such emails to enable intelligent assistants. This allows experiences like being able to answer questions such as “What is the address of my hotel in New York?” or “When does my flight leave?”. A high-quality email classifier is a critical piece in such a system. In this paper, we argue that the rich formatting used in business-to-consumer emails contains valuable information that can be used to learn better representations. Most existing methods focus only on textual content and ignore the rich HTML structure of emails. We introduce RiSER (Richly Structured Email Representation) - an approach for incorporating both the structure and content of emails. RiSER projects the email into a vector representation by jointly encoding the HTML structure and the words in the email. We then use this representation to train a classifier. To our knowledge, this is the first description of a neural technique for combining formatting information along with the content to learn improved representations for richly formatted emails. Experimenting with a large corpus of emails received by users of Gmail, we show that RiSER outperforms strong attention-based LSTM baselines. We expect that these benefits will extend to other corpora with richly formatted documents. We also demonstrate with examples where leveraging HTML structure leads to better predictions.
Year
DOI
Venue
2019
10.1145/3308558.3313720
WWW '19: The Web Conference on The World Wide Web Conference WWW 2019
Keywords
Field
DocType
Email Classification, Email Representation, HTML Structure Encoding
World Wide Web,Computer science,Email classification,Disk formatting,Classifier (linguistics),Data model,Encoding (memory)
Conference
ISBN
Citations 
PageRank 
978-1-4503-6674-8
1
0.35
References 
Authors
0
7
Name
Order
Citations
PageRank
Furkan Kocayusufoglu111.03
ying sheng222.08
Nguyen Vo3181.93
James Wendt410.69
Qi Zhao520.70
Sandeep Tata647827.50
Marc A. Najork72538278.16