Long-tail Vocabulary Dictionary Extraction from the Web. - Citegraph

Paper Info

Title
Long-tail Vocabulary Dictionary Extraction from the Web.

Abstract
A dictionary --- a set of instances belonging to the same conceptual class --- is central to information extraction and is a useful primitive for many applications, including query log analysis and document categorization. Considerable work has focused on generating accurate dictionaries given a few example seeds, but methods to date cannot obtain long-tail (rare) items with high accuracy and recall. In this paper, we develop a novel method to construct high-quality dictionaries, especially for long-tail vocabularies, using just a few user-provided seeds for each topic. Our algorithm obtains long-tail (i.e., rare) items by building and executing high-quality webpage-specific extractors. We use webpage-specific structural and textual information to build more accurate per-page extractors in order to detect the long-tail items from a single webpage. These webpage-specific extractors are obtained via a co-training procedure using distantly-supervised training data. By aggregating the page-specific dictionaries of many webpages, Lyretail is able to output a high-quality comprehensive dictionary. Our experiments demonstrate that in long-tail vocabulary settings, we obtained a 17.3% improvement on mean average precision for the dictionary generation process, and a 30.7% improvement on F1 for the page-specific extraction, when compared to previous state-of-the-art methods.

Year	DOI	Venue
2016	10.1145/2835776.2835778	WSDM
Keywords	Field	DocType
Set expansion, information extraction, long-tail dictionary	Data mining,Web page,Computer science,Textual information,Natural language processing,Artificial intelligence,Training set,Categorization,World Wide Web,Information retrieval,Information extraction,Set expansion,Vocabulary	Conference
Citations	PageRank	References
17	0.64	24
Authors
3

Authors (3 rows)

Cited by (17 rows)

References (24 rows)

Name	Order	Citations	PageRank
Zhe Chen	1	83	3.28
Michael J. Cafarella	2	2246	144.15
H. V. Jagadish	3	34	4.65

1