Automatic extraction of catalog data from digital images of historical manuscripts. - Citegraph

Paper Info

Title
Automatic extraction of catalog data from digital images of historical manuscripts.

Abstract
The Cairo Genizah, discovered in the late 19th century, is a collection of handwritten historical documents containing approximately 350,000 fragments of mainly Jewish texts. The fragments are today spread out in more than seventy libraries and private collections worldwide, and there is an ongoing effort to document and catalog all extant fragments. We explore three levels of extraction of catalog data from digital images of the fragments. First, images should be captured in a way that permits standardized automatic processing. Second, the images can be processed to detect elements such as image foreground, regions of written text, and lines of the text, thereby allowing for the automatic assignment of conventional catalog measurements. Third, modern computer-vision tools and statistical inference techniques may be used to identify fragments that might originate from the same original codex. Such matched fragments, commonly referred to as 'joins', were heretofore identified manually by experts, and presumably only a small fraction of existing joins have been discovered to date. Overall, we present what might be the first effort to address all three levels successfully within a large-scale project, detailing the various design choices and describing the techniques and algorithms used for the Cairo Genizah digitization project.

Year	DOI	Venue
2013	10.1093/llc/fqt007	LITERARY AND LINGUISTIC COMPUTING
DocType	Volume	Issue
Journal	28	SP2
ISSN	Citations	PageRank
0268-1145	1	0.37
References	Authors
6	4

Authors (4 rows)

Cited by (1 rows)

References (6 rows)

Name	Order	Citations	PageRank
Roni Shweka	1	30	3.96
Yaacov Choueka	2	241	202.83
Lior Wolf	3	5501	352.38
Nachum Dershowitz	4	2818	473.00

1