Title
Automated processing of digitized historical newspapers beyond the article level: sections and regular features
Abstract
Millions of pages of historical newspapers have been digitized but in most cases access to these are supported by only basic search services. We are exploring interactive services for these collections which would be useful for supporting access, including automatic categorization of articles. Such categorization is difficult because of the uneven quality of the OCR text, but there are many clues which can be useful for improving the accuracy of the categorization. Here, we describe observations of several historical newspapers to determine the characteristics of sections. We then explore how to automatically identify those sections and how to detect serialized feature articles which are repeated across days and weeks. The goal is not the introduction of new algorithms but the development of practical and robust techniques. For both analyses we find substantial success for some categories and articles, but others prove very difficult.
Year
DOI
Venue
2010
10.1007/978-3-642-13654-2_11
ICADL
Keywords
Field
DocType
cases access,interactive service,new algorithm,article level,automated processing,basic search service,robust technique,substantial success,regular feature,ocr text,serialized feature article,digitized historical newspaper,historical newspaper,automatic categorization,newspapers,classification,digital humanities
Categorization,Data mining,World Wide Web,Information retrieval,Computer science,Newspaper,Text processing
Conference
Volume
ISSN
ISBN
6102
0302-9743
3-642-13653-2
Citations 
PageRank 
References 
1
0.56
5
Authors
2
Name
Order
Citations
PageRank
Robert B. Allen12030338.48
Catherine Hall2977.62