Title
Efficient keyword extraction for meaningful document perception
Abstract
Keyword extraction is a common technique in the domain of information retrieval. Keywords serve as a minimalistic summary for single documents or document collections, enabling the reader to quickly perceive the main contents of a text. However, they are often not readily available for the documents of interest. Common keyword extraction techniques demand either a large data collection, a learning process, or access to extensive amounts of reference data. By relying on additional linguistic features (e.g. stop word removal), most approaches are language-restricted. Moreover, the extracted keywords usually pertain to the entire document, rather than only to the portion that is of interest to the reader. In this paper, we present an efficient and flexible approach to summarize selections of text within a document. Our solution is based on a keyword extraction algorithm that is applicable to a variety of documents, regardless of language or context. This algorithm relies on the Helmholtz principle and extends a recently presented approach. Our extension covers the features of a weighting algorithm while providing a self-regulation capability to allow for more meaningful results. Furthermore, our approach takes into account the document structure in order to enhance pure statistic summarizations. We evaluate the efficiency of our approach and present results with meaningful examples. In addition, we outline further applications of our approach that allow for enhanced document perception as well as for meaningful document indexing and retrieval.
Year
DOI
Venue
2011
10.1145/2034691.2034732
ACM Symposium on Document Engineering
Keywords
Field
DocType
single document,entire document,efficient keyword extraction,flexible approach,keyword extraction algorithm,keyword extraction,enhanced document perception,document collection,meaningful document indexing,meaningful document perception,common keyword extraction technique,document structure,heuristic algorithm,reference data,information retrieval
Data collection,tf–idf,Information retrieval,Heuristic (computer science),Document clustering,Computer science,Keyword extraction,Document Structure Description,Search engine indexing,Database,Stop words
Conference
Citations 
PageRank 
References 
6
0.48
22
Authors
3
Name
Order
Citations
PageRank
Thomas Bohne171.18
Sebastian Rönnau2786.28
Uwe M. Borghoff3412175.51