Distilling and exploring nuggets from a corpus - Citegraph

Paper Info

Title
Distilling and exploring nuggets from a corpus

Abstract
This paper describes a live and scalable system that automatically extracts information nuggets for entities/topics from a continuously updated corpus for effective exploration and analysis. A nugget is a piece of semantic information that (1) must be mapped semantically to the transitive closure of a pre-defined ontology, (2) is explicitly supported by text, and (3) has a natural language description that completely conveys its semantic to a user. Fig. 1 shows a type of nugget "involvement in events" for a person entity (Leon Panetta): each nugget has a short description ("meeting", "news conference") with a list of supporting passages. Our key contributions are (1) We extract nuggets and remove redundancy to produce a summary of salient information with supporting clusters of passages. (2) We present an entity/topic centric exploration interface that also allows users to navigate to other entities involved in a nugget. (3) We use the statistical NLP technologies developed over the years in the ACE ,GALE and TAC-KBP programs, including parsing, mention detection, within and cross document coreference resolution, relation detection and slot filler extraction. (4) Our system is flexible and easily adaptable across domains as demonstrated on two corpora: generic news and scientific papers. Search engines such as Google News and Scholar do not retrieve nuggets, and only remove redundancy at document level. News aggregation applications such as Evri categorize news articles based on the entities of topics but do not extract nuggets. Other systems extract richer information, but not all of it has clear semantics; e.g., Silobreaker presents results as "the relationship between X and Y in the context of [keyphrase]", leaving users with the task of interpreting the semantics as it is not tied to a clear ontology. In contrast we remove redundancy, summarize results and present nuggets that have clear semantics.

Year	DOI	Venue
2012	10.1145/2348283.2348431	SIGIR
Keywords	Field	DocType
generic news,richer information,news conference,evri categorize news article,semantic information,news aggregation application,clear semantics,salient information,present nugget,extracts information nugget,natural language processing,search engine,transitive closure,natural language,summarization,user interfaces,user interface	Data mining,Ontology,News aggregator,Computer science,Automatic Content Extraction,Natural language processing,Artificial intelligence,Automatic summarization,Coreference,Information retrieval,Natural language,Parsing,Semantics	Conference
Citations	PageRank	References
6	0.56	0
Authors
6

Authors (6 rows)

Cited by (6 rows)

References (0 rows)

Name	Order	Citations	PageRank
Vittorio Castelli	1	928	129.71
Hema Raghavan	2	414	21.18
Radu Florian	3	924	91.44
Ding-Jung Han	4	17	2.46
Xiaoqiang Luo	5	711	52.14
Salim Roukos	6	6248	845.50

1