Title
Long-tail Vocabulary Dictionary Extraction from the Web.
Abstract
A dictionary --- a set of instances belonging to the same conceptual class --- is central to information extraction and is a useful primitive for many applications, including query log analysis and document categorization. Considerable work has focused on generating accurate dictionaries given a few example seeds, but methods to date cannot obtain long-tail (rare) items with high accuracy and recall. In this paper, we develop a novel method to construct high-quality dictionaries, especially for long-tail vocabularies, using just a few user-provided seeds for each topic. Our algorithm obtains long-tail (i.e., rare) items by building and executing high-quality webpage-specific extractors. We use webpage-specific structural and textual information to build more accurate per-page extractors in order to detect the long-tail items from a single webpage. These webpage-specific extractors are obtained via a co-training procedure using distantly-supervised training data. By aggregating the page-specific dictionaries of many webpages, Lyretail is able to output a high-quality comprehensive dictionary. Our experiments demonstrate that in long-tail vocabulary settings, we obtained a 17.3% improvement on mean average precision for the dictionary generation process, and a 30.7% improvement on F1 for the page-specific extraction, when compared to previous state-of-the-art methods.
Year
DOI
Venue
2016
10.1145/2835776.2835778
WSDM
Keywords
Field
DocType
Set expansion, information extraction, long-tail dictionary
Data mining,Web page,Computer science,Textual information,Natural language processing,Artificial intelligence,Training set,Categorization,World Wide Web,Information retrieval,Information extraction,Set expansion,Vocabulary
Conference
Citations 
PageRank 
References 
17
0.64
24
Authors
3
Name
Order
Citations
PageRank
Zhe Chen1833.28
Michael J. Cafarella22246144.15
H. V. Jagadish3344.65