Title
Abstract: Digitization and Search: A Non-Traditional Use of HPC
Abstract
We describe our efforts to provide a form of automated search of handwritten content for digitized document archives. To carry out the search we use a computer vision technique called word spotting. A form of content based image retrieval, it avoids the still difficult task of directly recognizing text by allowing a user to search using a query image containing handwritten text and ranking a database of images in terms of those that contain more similar looking content. In order to make this search capability available on an archive three computationally expensive pre-processing steps are required. We augment this automated portion of the process with a passive crowd sourcing element that mines queries from the systems users in order to then improve the results of future queries. We benchmark the proposed framework on 1930s Census data, a collection of roughly 3.6 million forms and 7 billion individual units of information.
Year
DOI
Venue
2012
10.1109/SC.Companion.2012.259
High Performance Computing, Networking, Storage and Analysis
Keywords
Field
DocType
search capability,query image,billion individual unit,handwritten text,non-traditional use,census data,image retrieval,automated portion,million form,handwritten content,automated search,information retrieval systems,computer vision,big data,parallel processing
Data mining,Automatic image annotation,Query expansion,Information retrieval,Computer science,Full text search,Image retrieval,Document retrieval,Concept search,Content-based image retrieval,Visual Word
Conference
ISBN
Citations 
PageRank 
978-1-4673-6218-4
0
0.34
References 
Authors
2
5
Name
Order
Citations
PageRank
Liana Diesendruck1123.60
Luigi Marini28514.61
Rob Kooper31234235.10
Mayank Kejriwal43911.73
Kenton McHenry55411.15