Abstract | ||
---|---|---|
Disambiguating entity references by annotating them with unique ids from a catalog is a critical step in the enrichment of unstructured content. In this paper, we show that topic models, such as Latent Dirichlet Allocation (LDA) and its hierarchical variants, form a natural class of models for learning accurate entity disambiguation models from crowd-sourced knowledge bases such as Wikipedia. Our main contribution is a semi-supervised hierarchical model called Wikipedia-based Pachinko Allocation Model} (WPAM) that exploits: (1) All words in the Wikipedia corpus to learn word-entity associations (unlike existing approaches that only use words in a small fixed window around annotated entity references in Wikipedia pages), (2) Wikipedia annotations to appropriately bias the assignment of entity labels to annotated (and co-occurring unannotated) words during model learning, and (3) Wikipedia's category hierarchy to capture co-occurrence patterns among entities. We also propose a scheme for pruning spurious nodes from Wikipedia's crowd-sourced category hierarchy. In our experiments with multiple real-life datasets, we show that WPAM outperforms state-of-the-art baselines by as much as 16% in terms of disambiguation accuracy. |
Year | DOI | Venue |
---|---|---|
2011 | 10.1145/2020408.2020574 | KDD |
Keywords | Field | DocType |
hierarchical topic model,disambiguating entity reference,wikipedia corpus,latent dirichlet allocation,wikipedia page,entity label,category hierarchy,wikipedia annotation,annotated entity reference,accurate entity disambiguation model,allocation model,hierarchical model,entity resolution,topic models,knowledge base | Data mining,Latent Dirichlet allocation,Computer science,Natural class,Pachinko allocation,Artificial intelligence,Natural language processing,Hierarchy,Hierarchical database model,Information retrieval,Exploit,Topic model,Model learning | Conference |
Citations | PageRank | References |
36 | 1.23 | 17 |
Authors | ||
5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Saurabh S. Kataria | 1 | 36 | 1.23 |
Krishnan S. Kumar | 2 | 48 | 1.85 |
Rajeev Rastogi | 3 | 6151 | 827.22 |
Prithviraj Sen | 4 | 837 | 38.24 |
Srinivasan H. Sengamedu | 5 | 180 | 9.21 |