Abstract | ||
---|---|---|
When searching or browsing documents, the genre of a document is an important consideration that complements topical characterization. We examine design considerations for automatic tagging of office document pages with genre membership. These include selecting features that characterize genre-related information in office documents, examining the utility of text-based features and image-based features, and proposing a simple ensemble method to improve the performance of genre identification. Experiments were conducted on the open-set identification of four coarse office document genres: technical paper, photo, slide, and table. Our experiments show that when combined with image-based features, text-based features do not significantly influence performance. These results provide support for a topic-independent approach to identification of coarse office document genres. Experiments also show that our simple ensemble method significantly improves performance relative to using a support vector machine (SVM) classifier alone. We demonstrate the utility of our approach by integrating our automatic genre tags in a faceted search and browsing application for office document collections. |
Year | DOI | Venue |
---|---|---|
2012 | 10.1007/s10032-011-0163-7 | IJDAR |
Keywords | Field | DocType |
office documents,image features,text features,genre identification,classification | Information retrieval,Faceted search,Computer science,Feature (computer vision),Support vector machine,Classifier (linguistics) | Journal |
Volume | Issue | ISSN |
15 | 3 | 1433-2825 |
Citations | PageRank | References |
3 | 0.44 | 25 |
Authors | ||
5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Francine Chen | 1 | 1218 | 153.96 |
Andreas Girgensohn | 2 | 1724 | 185.73 |
Matthew Cooper | 3 | 798 | 76.01 |
Yijuan Lu | 4 | 732 | 46.24 |
Gerry Filby | 5 | 31 | 2.84 |