Title
Exploring Web Archives Through Temporal Anchor Texts.
Abstract
Web archives have been instrumental in digital preservation of the Web and provide great opportunity for the study of the societal past and evolution. These Web archives are massive collections, typically in the order of terabytes and petabytes. Due to this, search and exploration of archives has been limited as full-text indexing is both resource and computationally expensive. We identify that for typical access methods to archives, which are navigational and temporal in nature, we do not always require indexing full-text. Instead, meaningful text surrogates like anchor texts already go a long way in providing meaningful solutions and can act as reasonable entry points to exploring Web archives. In this paper, we present a new approach to searching Web archives based on temporal link graphs and corresponding anchor texts. Departing from traditional informational intents, we show how temporal anchor texts can be effective in answering queries beyond purely navigational intents, like finding the most central webpages of an entity in a given time period. We propose indexing methods and a temporal retrieval model based on anchor texts. Further, we discuss several interesting search results as well as one experiment in which we demonstrate how such results can be integrated in a data processing workflow to scale up to thousands of pages. In this analysis we were able to replicate results reported by an offline study, showing that restaurant prices indeed increased in Germany when the Euro was introduced as Europe's currency.
Year
DOI
Venue
2017
10.1145/3091478.3091500
WebSci
Keywords
Field
DocType
Web Archives, Temporal Information Retrieval, Big Data Analysis
Digital preservation,World Wide Web,Web page,Information retrieval,Terabyte,Computer science,Petabyte,Search engine indexing,Anchor text,Workflow,Big data
Conference
Citations 
PageRank 
References 
3
0.38
16
Authors
3
Name
Order
Citations
PageRank
Helge Holzmann17011.16
Wolfgang Nejdl26633556.13
Avishek Anand310211.61