Title
The ENP image and ground truth dataset of historical newspapers
Abstract
This paper presents a research dataset of historical newspapers comprising over 500 page images, uniquely representative of European cultural heritage from the digitization projects of 12 national and major European libraries, created within the scope of the large-scale digitisation Europeana Newspapers Project (ENP). Every image is accompanied by comprehensive ground truth (Unicode encoded full-text, layout information with precise region outlines, type labels, and reading order) in PAGE format and searchable metadata about document characteristics and artefacts. The first part of the paper describes the nature of the dataset, how it was built, and the challenges encountered. In the second part, a baseline for two state-of-the-art OCR systems (ABBYY FineReader Engine 11 and Tesseract 3.03) is given with regard to both text recognition and segmentation/layout analysis performance.
Year
DOI
Venue
2015
10.1109/ICDAR.2015.7333898
International Conference on Document Analysis and Recognition
Keywords
Field
DocType
image dataset, document analysis, ground truth, historical documents
Metadata,World Wide Web,Digitization,Cultural heritage,Information retrieval,Computer science,Segmentation,Newspaper,Ground truth,Tesseract,Unicode
Conference
ISSN
Citations 
PageRank 
1520-5363
5
0.54
References 
Authors
5
4
Name
Order
Citations
PageRank
Christian Clausner1448.49
Christos Papadopoulos2584.06
stefan pletschacher321620.78
Apostolos Antonacopoulos437836.45