Title
Entity disambiguation with hierarchical topic models
Abstract
Disambiguating entity references by annotating them with unique ids from a catalog is a critical step in the enrichment of unstructured content. In this paper, we show that topic models, such as Latent Dirichlet Allocation (LDA) and its hierarchical variants, form a natural class of models for learning accurate entity disambiguation models from crowd-sourced knowledge bases such as Wikipedia. Our main contribution is a semi-supervised hierarchical model called Wikipedia-based Pachinko Allocation Model} (WPAM) that exploits: (1) All words in the Wikipedia corpus to learn word-entity associations (unlike existing approaches that only use words in a small fixed window around annotated entity references in Wikipedia pages), (2) Wikipedia annotations to appropriately bias the assignment of entity labels to annotated (and co-occurring unannotated) words during model learning, and (3) Wikipedia's category hierarchy to capture co-occurrence patterns among entities. We also propose a scheme for pruning spurious nodes from Wikipedia's crowd-sourced category hierarchy. In our experiments with multiple real-life datasets, we show that WPAM outperforms state-of-the-art baselines by as much as 16% in terms of disambiguation accuracy.
Year
DOI
Venue
2011
10.1145/2020408.2020574
KDD
Keywords
Field
DocType
hierarchical topic model,disambiguating entity reference,wikipedia corpus,latent dirichlet allocation,wikipedia page,entity label,category hierarchy,wikipedia annotation,annotated entity reference,accurate entity disambiguation model,allocation model,hierarchical model,entity resolution,topic models,knowledge base
Data mining,Latent Dirichlet allocation,Computer science,Natural class,Pachinko allocation,Artificial intelligence,Natural language processing,Hierarchy,Hierarchical database model,Information retrieval,Exploit,Topic model,Model learning
Conference
Citations 
PageRank 
References 
36
1.23
17
Authors
5
Name
Order
Citations
PageRank
Saurabh S. Kataria1361.23
Krishnan S. Kumar2481.85
Rajeev Rastogi36151827.22
Prithviraj Sen483738.24
Srinivasan H. Sengamedu51809.21