Corpus Assembly as Text Data Integration from Digital Libraries and the Web. - Citegraph

Paper Info

Title
Corpus Assembly as Text Data Integration from Digital Libraries and the Web.

Abstract
We here explore a new corpus construction workflow which exploits the inherent potential of the growing number of Digital Libraries worldwide and the ever-expanding Internet Archive. Rather than building corpora from scratch (which typically consumes a huge amount of resources), we search the Web for fragments of relevant digitized contents scattered across the world, check their digitization quality, select those digital versions with highest quality, and finally assemble from those pieces an integrated corpus with a maximum coverage of the targeted resource. As a use case within the framework of Digital Humanities, we illustrate this approach for the Allgemeine Literatur-Zeitung (General Literature Gazette, ALZ) published from 1785 to 1849, which is considered as one of the most important text collections from the Romantic Age in Germany. With lots of incomplete and overlapping fragments physically scattered over many Web sites, we started to assemble these fragments, to bind these pieces together using a homogeneous format, and thus constructed the first (almost) complete corpus of ALZ, now accessible (in XML format obeying to TEI standards) as a whole for in-depth scientific investigations.

Year	DOI	Venue
2019	10.1109/JCDL.2019.00014	JCDL
Keywords	Field	DocType
Digital Humanities, Digital Libraries, Internet Archive, Document Management, German Romanticism, Allgemeine Literatur-Zeitung	Data integration,World Wide Web,Digitization,Information retrieval,XML,Document management system,Computer science,Exploit,Digital library,Workflow,The Internet	Conference
ISSN	ISBN	Citations
2575-7865	978-1-7281-1547-4	0
PageRank	References	Authors
0.34	0	2

Authors (2 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Udo Hahn	1	937	88.14
Tinghui Duan	2	0	1.01

1