Abstract | ||
---|---|---|
We here explore a new corpus construction workflow which exploits the inherent potential of the growing number of Digital Libraries worldwide and the ever-expanding Internet Archive. Rather than building corpora from scratch (which typically consumes a huge amount of resources), we search the Web for fragments of relevant digitized contents scattered across the world, check their digitization quality, select those digital versions with highest quality, and finally assemble from those pieces an integrated corpus with a maximum coverage of the targeted resource. As a use case within the framework of Digital Humanities, we illustrate this approach for the Allgemeine Literatur-Zeitung (General Literature Gazette, ALZ) published from 1785 to 1849, which is considered as one of the most important text collections from the Romantic Age in Germany. With lots of incomplete and overlapping fragments physically scattered over many Web sites, we started to assemble these fragments, to bind these pieces together using a homogeneous format, and thus constructed the first (almost) complete corpus of ALZ, now accessible (in XML format obeying to TEI standards) as a whole for in-depth scientific investigations.
|
Year | DOI | Venue |
---|---|---|
2019 | 10.1109/JCDL.2019.00014 | JCDL |
Keywords | Field | DocType |
Digital Humanities, Digital Libraries, Internet Archive, Document Management, German Romanticism, Allgemeine Literatur-Zeitung | Data integration,World Wide Web,Digitization,Information retrieval,XML,Document management system,Computer science,Exploit,Digital library,Workflow,The Internet | Conference |
ISSN | ISBN | Citations |
2575-7865 | 978-1-7281-1547-4 | 0 |
PageRank | References | Authors |
0.34 | 0 | 2 |
Name | Order | Citations | PageRank |
---|---|---|---|
Udo Hahn | 1 | 937 | 88.14 |
Tinghui Duan | 2 | 0 | 1.01 |