Title | ||
---|---|---|
WebSets: extracting sets of entities from the web using unsupervised information extraction |
Abstract | ||
---|---|---|
We describe a open-domain information extraction method for extracting concept-instance pairs from an HTML corpus. Most earlier approaches to this problem rely on combining clusters of distributionally similar terms and concept-instance pairs obtained with Hearst patterns. In contrast, our method relies on a novel approach for clustering terms found in HTML tables, and then assigning concept names to these clusters using Hearst patterns. The method can be efficiently applied to a large corpus, and experimental results on several datasets show that our method can accurately extract large numbers of concept-instance pairs. |
Year | DOI | Venue |
---|---|---|
2012 | 10.1145/2124295.2124327 | Proceedings of the fifth ACM international conference on Web search and data mining |
Keywords | DocType | Volume |
assigning concept name,open-domain information extraction method,distributionally similar term,large number,html table,large corpus,hearst pattern,concept-instance pair,unsupervised information extraction,clustering term,html corpus,information extraction,clustering,web mining | Conference | abs/1307.0261 |
Citations | PageRank | References |
47 | 1.24 | 23 |
Authors | ||
3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Bhavana Bharat Dalvi | 1 | 201 | 17.31 |
William W. Cohen | 2 | 10178 | 1243.74 |
James P. Callan | 3 | 6237 | 833.28 |