Title
Exploring Archives With Probabilistic Models: Topic Modelling For The Valorisation Of Digitised Archives Of The European Commission
Abstract
Topic Modelling (TM) has gained momentum over the last few years within the humanities to analyze topics represented in large volumes of full text. This paper proposes an experiment with the usage of TM based on a large subset of digitized archival holdings of the European Commission (EC). Currently, millions of scanned and OCRed files are available and hold the potential to significantly change the way historians of the construction and evolution of the European Union can perform their research. However, due to a lack of resources, only minimal metadata are available on a file and document level, seriously undermining the accessibility of this archival collection. The article explores in an empirical manner the possibilities and limits of TM to automatically extract key concepts from a large body of documents spanning multiple decades. By mapping the topics to headings of the EUROVOC thesaurus, the proof of concept described in this paper offers the future possibility to represent the identified topics with the help of a hierarchical search interface for end-users.
Year
Venue
Keywords
2016
2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA)
LDA, topic modelling, archives, topic modeling
Field
DocType
Citations 
Hierarchical search,Data science,Data mining,Metadata,Commission,Computer science,Proof of concept,Topic model,Probabilistic logic,Portable document format,European union
Conference
0
PageRank 
References 
Authors
0.34
7
5
Name
Order
Citations
PageRank
simon hengchen101.69
Mathias Coeckelbergs200.34
Seth van Hooland3518.71
Ruben Verborgh4630105.49
Thomas Steiner5747.84