Title
Corpus Assembly as Text Data Integration from Digital Libraries and the Web.
Abstract
We here explore a new corpus construction workflow which exploits the inherent potential of the growing number of Digital Libraries worldwide and the ever-expanding Internet Archive. Rather than building corpora from scratch (which typically consumes a huge amount of resources), we search the Web for fragments of relevant digitized contents scattered across the world, check their digitization quality, select those digital versions with highest quality, and finally assemble from those pieces an integrated corpus with a maximum coverage of the targeted resource. As a use case within the framework of Digital Humanities, we illustrate this approach for the Allgemeine Literatur-Zeitung (General Literature Gazette, ALZ) published from 1785 to 1849, which is considered as one of the most important text collections from the Romantic Age in Germany. With lots of incomplete and overlapping fragments physically scattered over many Web sites, we started to assemble these fragments, to bind these pieces together using a homogeneous format, and thus constructed the first (almost) complete corpus of ALZ, now accessible (in XML format obeying to TEI standards) as a whole for in-depth scientific investigations.
Year
DOI
Venue
2019
10.1109/JCDL.2019.00014
JCDL
Keywords
Field
DocType
Digital Humanities, Digital Libraries, Internet Archive, Document Management, German Romanticism, Allgemeine Literatur-Zeitung
Data integration,World Wide Web,Digitization,Information retrieval,XML,Document management system,Computer science,Exploit,Digital library,Workflow,The Internet
Conference
ISSN
ISBN
Citations 
2575-7865
978-1-7281-1547-4
0
PageRank 
References 
Authors
0.34
0
2
Name
Order
Citations
PageRank
Udo Hahn193788.14
Tinghui Duan201.01