Title
Learning document aboutness from implicit user feedback and document structure
Abstract
Capturing the "aboutness" of documents has been a key research focus throughout the history of automated textual information processing. In this work, we represent aboutness using words and phrases that best reflect the central topics of a document. We present a machine learning approach that learns to score and rank words and phrases in a document according to their relevance to the document. We use implicit user feedback available in search engine click logs to characterize the user-perceived notion of term relevance. Using a small set of manually generated training data, we show that the surrogate training data from click logs correlates well with this data, thus eliminating the need to create data for training manually which is both expensive and fundamentally difficult to obtain for such a task. Further, we use a diverse set of features in our learning model that capitalize heavily on the structural and visual properties of web documents. In our extensive experimentation, we pay particular attention to tail web pages and show that our approach trained on mainly head web pages generalizes and performs well on all kinds of documents. In several evaluation methods using manually generated summaries and term relevance judgments, our system shows 25% improvement over other aboutness solutions.
Year
DOI
Venue
2009
10.1145/1645953.1646002
CIKM
Keywords
Field
DocType
click log,term relevance judgment,tail web page,training data,implicit user feedback,aboutness solution,web pages generalizes,document aboutness,surrogate training data,web document,term relevance,document structure,diverse set,information processing,machine learning,search engine,web pages,ranking
Training set,Data mining,Web page,Ranking,Information retrieval,Computer science,Textual information,Document Structure Description,Aboutness,Artificial intelligence,Natural language processing,Small set
Conference
Citations 
PageRank 
References 
31
1.74
27
Authors
1
Name
Order
Citations
PageRank
Deepa Paranjpe11609.39