Title
The Detection of Duplicates in Document Image Databases
Abstract
In this paper we propose and implement a method for detecting duplicate documents in very large image databases. The method is based on a robust "signature" extracted from each document image which is used to index into a table of previously processed documents. The approach has a number of advantages over OCR or other recognition based methods including speed and robustness to imaging distortions. To justify the approach and test the scalability, we have developed a simulator which allows us to change parameters of the system and examine performance for millions of document signatures. A complete system is implemented and tested on a test collection of technical articles and memos.
Year
DOI
Venue
1998
10.1016/S0262-8856(98)00054-7
Image and Vision Computing
Keywords
Field
DocType
robustness,database systems,system testing,indexation,coding,data bases,systems approach,queueing theory,distortion,feature extraction,distributed environment,indexes,collection,image retrieval,image recognition,simulators
Data mining,Document imaging,Distributed Computing Environment,Computer science,Search engine indexing,Image quality,Image processing,Robustness (computer science),Shape coding,Database,Scalability
Journal
Volume
Issue
ISSN
16
12-13
Image and Vision Computing
ISBN
Citations 
PageRank 
0-8186-7898-4
34
4.23
References 
Authors
4
3
Name
Order
Citations
PageRank
David Doermann14313312.70
Huiping Li29812.58
Omid E. Kia36611.12