Title
Scalable knowledge harvesting with high precision and high recall
Abstract
Harvesting relational facts from Web sources has received great attention for automatically constructing large knowledge bases. Stateof-the-art approaches combine pattern-based gathering of fact candidates with constraint-based reasoning. However, they still face major challenges regarding the trade-offs between precision, recall, and scalability. Techniques that scale well are susceptible to noisy patterns that degrade precision, while techniques that employ deep reasoning for high precision cannot cope with Web-scale data. This paper presents a scalable system, called PROSPERA, for high-quality knowledge harvesting. We propose a new notion of ngram-itemsets for richer patterns, and use MaxSat-based constraint reasoning on both the quality of patterns and the validity of fact candidates.We compute pattern-occurrence statistics for two benefits: they serve to prune the hypotheses space and to derive informative weights of clauses for the reasoner. The paper shows how to incorporate these building blocks into a scalable architecture that can parallelize all phases on a Hadoop-based distributed platform. Our experiments with the ClueWeb09 corpus include comparisons to the recent ReadTheWeb experiment. We substantially outperform these prior results in terms of recall, with the same precision, while having low run-times.
Year
DOI
Venue
2011
10.1145/1935826.1935869
WSDM
Keywords
Field
DocType
scalable knowledge,scalable architecture,degrade precision,constraint-based reasoning,relational fact,high recall,deep reasoning,large knowledge base,maxsat-based constraint reasoning,high precision,fact candidate,high-quality knowledge harvesting,scalability,knowledge base,information extraction
Data mining,Computer science,Artificial intelligence,Constraint reasoning,Maximum satisfiability problem,Semantic reasoner,Scalable architecture,Information retrieval,Information extraction,Scalable system,Recall,Machine learning,Scalability
Conference
Citations 
PageRank 
References 
101
3.57
29
Authors
3
Search Limit
100101
Name
Order
Citations
PageRank
Ndapandula Nakashole139419.48
Martin Theobald2147472.06
Gerhard Weikum3127102146.01