Title
Distilling and exploring nuggets from a corpus
Abstract
This paper describes a live and scalable system that automatically extracts information nuggets for entities/topics from a continuously updated corpus for effective exploration and analysis. A nugget is a piece of semantic information that (1) must be mapped semantically to the transitive closure of a pre-defined ontology, (2) is explicitly supported by text, and (3) has a natural language description that completely conveys its semantic to a user. Fig. 1 shows a type of nugget "involvement in events" for a person entity (Leon Panetta): each nugget has a short description ("meeting", "news conference") with a list of supporting passages. Our key contributions are (1) We extract nuggets and remove redundancy to produce a summary of salient information with supporting clusters of passages. (2) We present an entity/topic centric exploration interface that also allows users to navigate to other entities involved in a nugget. (3) We use the statistical NLP technologies developed over the years in the ACE ,GALE and TAC-KBP programs, including parsing, mention detection, within and cross document coreference resolution, relation detection and slot filler extraction. (4) Our system is flexible and easily adaptable across domains as demonstrated on two corpora: generic news and scientific papers. Search engines such as Google News and Scholar do not retrieve nuggets, and only remove redundancy at document level. News aggregation applications such as Evri categorize news articles based on the entities of topics but do not extract nuggets. Other systems extract richer information, but not all of it has clear semantics; e.g., Silobreaker presents results as "the relationship between X and Y in the context of [keyphrase]", leaving users with the task of interpreting the semantics as it is not tied to a clear ontology. In contrast we remove redundancy, summarize results and present nuggets that have clear semantics.
Year
DOI
Venue
2012
10.1145/2348283.2348431
SIGIR
Keywords
Field
DocType
generic news,richer information,news conference,evri categorize news article,semantic information,news aggregation application,clear semantics,salient information,present nugget,extracts information nugget,natural language processing,search engine,transitive closure,natural language,summarization,user interfaces,user interface
Data mining,Ontology,News aggregator,Computer science,Automatic Content Extraction,Natural language processing,Artificial intelligence,Automatic summarization,Coreference,Information retrieval,Natural language,Parsing,Semantics
Conference
Citations 
PageRank 
References 
6
0.56
0
Authors
6
Name
Order
Citations
PageRank
Vittorio Castelli1928129.71
Hema Raghavan241421.18
Radu Florian392491.44
Ding-Jung Han4172.46
Xiaoqiang Luo571152.14
Salim Roukos66248845.50