Title
Genre identification for office document search and browsing
Abstract
When searching or browsing documents, the genre of a document is an important consideration that complements topical characterization. We examine design considerations for automatic tagging of office document pages with genre membership. These include selecting features that characterize genre-related information in office documents, examining the utility of text-based features and image-based features, and proposing a simple ensemble method to improve the performance of genre identification. Experiments were conducted on the open-set identification of four coarse office document genres: technical paper, photo, slide, and table. Our experiments show that when combined with image-based features, text-based features do not significantly influence performance. These results provide support for a topic-independent approach to identification of coarse office document genres. Experiments also show that our simple ensemble method significantly improves performance relative to using a support vector machine (SVM) classifier alone. We demonstrate the utility of our approach by integrating our automatic genre tags in a faceted search and browsing application for office document collections.
Year
DOI
Venue
2012
10.1007/s10032-011-0163-7
IJDAR
Keywords
Field
DocType
office documents,image features,text features,genre identification,classification
Information retrieval,Faceted search,Computer science,Feature (computer vision),Support vector machine,Classifier (linguistics)
Journal
Volume
Issue
ISSN
15
3
1433-2825
Citations 
PageRank 
References 
3
0.44
25
Authors
5
Name
Order
Citations
PageRank
Francine Chen11218153.96
Andreas Girgensohn21724185.73
Matthew Cooper379876.01
Yijuan Lu473246.24
Gerry Filby5312.84