Abstract | ||
---|---|---|
We describe our efforts to provide a form of automated search of handwritten content for digitized document archives. To carry out the search we use a computer vision technique called word spotting. A form of content based image retrieval, it avoids the still difficult task of directly recognizing text by allowing a user to search using a query image containing handwritten text and ranking a database of images in terms of those that contain more similar looking content. In order to make this search capability available on an archive three computationally expensive pre-processing steps are required. We augment this automated portion of the process with a passive crowd sourcing element that mines queries from the systems users in order to then improve the results of future queries. We benchmark the proposed framework on 1930s Census data, a collection of roughly 3.6 million forms and 7 billion individual units of information. |
Year | DOI | Venue |
---|---|---|
2012 | 10.1109/SC.Companion.2012.259 | High Performance Computing, Networking, Storage and Analysis |
Keywords | Field | DocType |
search capability,query image,billion individual unit,handwritten text,non-traditional use,census data,image retrieval,automated portion,million form,handwritten content,automated search,information retrieval systems,computer vision,big data,parallel processing | Data mining,Automatic image annotation,Query expansion,Information retrieval,Computer science,Full text search,Image retrieval,Document retrieval,Concept search,Content-based image retrieval,Visual Word | Conference |
ISBN | Citations | PageRank |
978-1-4673-6218-4 | 0 | 0.34 |
References | Authors | |
2 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Liana Diesendruck | 1 | 12 | 3.60 |
Luigi Marini | 2 | 85 | 14.61 |
Rob Kooper | 3 | 1234 | 235.10 |
Mayank Kejriwal | 4 | 39 | 11.73 |
Kenton McHenry | 5 | 54 | 11.15 |