Abstract | ||
---|---|---|
Managing large document databases has become an important task. Being able to automatically compare document layouts and classify and search documents with respect to their visual appearance proves to be desirable in many applications. We propose a new algorithm that approximates a metric function between documents based on their visual similarity. The comparison is based only on the visual appearance of the document without taking into consideration its text content. We measure the similarity of single page documents with respect to distance functions between three document components: background, text, and saliency. Each document component is represented as a Gaussian mixture distribution; and distances between the components of different documents are calculated as an approximation of the Hellinger distance between corresponding distributions. Since the Hellinger distance obeys the triangle inequality, it proves to be favorable in the task of nearest neighbor search in a document database. Thus, the computation required to find similar documents in a document database can be significantly reduced. |
Year | DOI | Venue |
---|---|---|
2011 | 10.1145/2034691.2034722 | ACM Symposium on Document Engineering |
Keywords | Field | DocType |
document layout,visual appearance,hellinger distance,document search,search document,document component,document visual similarity measure,large document databases,similar document,single page document,document database,different document,distance function,document retrieval,nearest neighbor search,triangle inequality | Hellinger distance,Similarity measure,Information retrieval,Pattern recognition,Computer science,Document clustering,Document layout analysis,Metric (mathematics),Artificial intelligence,Document retrieval,Nearest neighbor search,Visual appearance | Conference |
Citations | PageRank | References |
3 | 0.46 | 6 |
Authors | ||
8 |
Name | Order | Citations | PageRank |
---|---|---|---|
Ildus Ahmadullin | 1 | 6 | 1.21 |
Jan P. Allebach | 2 | 1230 | 170.88 |
Niranjan Damera-Venkata | 3 | 99 | 11.99 |
Jian Fan | 4 | 88 | 10.74 |
Seungyon Lee | 5 | 170 | 11.01 |
Qian Lin | 6 | 88 | 10.97 |
Jerry Liu | 7 | 82 | 8.65 |
Eamonn O'Brien-Strain | 8 | 264 | 18.47 |