Title
Making Two Vast Historical Manuscript Collections Searchable and Extracting Meaningful Textual Features Through Large-Scale Probabilistic Indexing
Abstract
Textual access to large collections of digitized images remains unfeasible because usually they lack transcripts. Transcribing such collections is in turn typically unattainable in terms of costs. However, the use of probabilistic indices can facilitate textual accessing with only moderate demands of resources. Besides allowing effortless information retrieval, it will be shown that probabilistic indices can also be used to estimate textual features of the indexed but otherwise untranscribed collections, such as running words and Zipf's curves. Complete probabilistic indices have been recently produced for two iconic large collections: "Bentham" (90K images) and "Spanish Golden Age Theater" (40K images). To show the repercussion of making these collections searchable, we provide accessing statistics gathered through their corresponding search interfaces. To the best of our knowledge this is the first publication of large collections of untranscribed manuscripts which are now publicly accessible for effective and efficient textual access.
Year
DOI
Venue
2019
10.1109/ICDAR.2019.00026
2019 International Conference on Document Analysis and Recognition (ICDAR)
Keywords
Field
DocType
search on large historical manuscript collections,probabilistic indexing and search,Zipf's law,keyword spotting,handwritten text
Transcription (linguistics),Zipf's law,Information retrieval,Pattern recognition,Computer science,Search engine indexing,Keyword spotting,Artificial intelligence,Probabilistic logic
Conference
ISSN
ISBN
Citations 
1520-5363
978-1-7281-3015-6
0
PageRank 
References 
Authors
0.34
7
4
Name
Order
Citations
PageRank
Alejandro Héctor Toselli100.34
Verónica Romero-Gomez200.34
Joan-Andreu Sánchez319829.00
Enrique Vidal4109685.46