Using Lucene to index and search the digitized 1940 US Census - Citegraph

Paper Info

Title
Using Lucene to index and search the digitized 1940 US Census

Abstract
n improved approach toward enabling search capabilities over large digitized document archives is described, in which Lucene indices were incorporated in a framework developed to provide automatic searchable access to the 1940 US Census, a collection composed of digitized handwritten forms. As an alternative to trying to recognize the handwritten text in the images, Word Spotting feature vectors are used to describe each cell's content. Instead of querying the system using regular ASCII text, any query is rendered as an image, and a ranked list of matching results is presented to the user. Among other preprocessing steps required by the framework, an index must be compiled to provide fast access to the feature vectors. The advantages and drawbacks of using Lucene to index these vectors instead of other indexing methods are discussed in light of the challenges confronted when dealing with digitized document collections of considerable size. Copyright © 2014 John Wiley & Sons, Ltd.

Year	DOI	Venue
2013	10.1002/cpe.3250	Concurrency and Computation: Practice & Experience
Keywords	DocType	Volume
lucene,feature vector,regular ascii text,word spotting feature vector,digitized handwritten form,automatic searchable access,content based retrieval,approximate similarity search,searchable access,lucene index,digitized document collection,large digitized document archives,fast access,handwritten text,us census	Conference	26
Issue	ISSN	Citations
13	1532-0626	4
PageRank	References	Authors
0.50	8	4

Authors (4 rows)

Cited by (4 rows)

References (8 rows)

Name	Order	Citations	PageRank
Liana Diesendruck	1	12	3.60
Rob Kooper	2	1234	235.10
Luigi Marini	3	85	14.61
Kenton McHenry	4	54	11.15

1