Title
Mirror, mirror on the Web: a study of host pairs with replicated content
Abstract
Two previous studies, one done at Stanford in 1997 based on data collected by the Google search engine, and one done at Digital in 1996 based on AltaVista data, revealed that almost a third of the Web consists of duplicate pages. Both studies identified mirroring, that is, the systematic replication of content over a pair of hosts, as the principal cause of duplication, but did not further investigate this phenomenon. The main aim of this paper is to present a clearer picture of mirroring on the Web. As input we used a set of 179 million URLs found during a Web crawl done in the summer of 1998. We looked at all hosts with more than 100 URLs in our input (about 238,000), and discovered that about 10% were mirrored to varying degrees. The paper presents data about the prevalence of mirroring based on a mirroring classification scheme that we define. There are numerous reasons for mirroring: technical (e.g., to improve access time), commercial (e.g., different intermediaries offering the same products), cultural (e.g., same content in two languages), social (e.g., sharing of research data), and so forth. Although we have not done a exhaustive study of the causes of replication, we discuss and provide examples for several representative cases. Our technique for detecting mirrored hosts from large sets of collected URLs depends mostly on the syntactic analysis of URL strings, and requires retrieval and content analysis only for a small number of pages. We are able to detect both partial and total mirroring, and handle cases where the content is not byte- wise identical. Furthermore, our technique is computationally very efficient and does not assume that the initial set of URLs gathered from each host is comprehensive. Hence, this approach has practical uses beyond our study, and can be applied in other settings. For instance, for web crawlers and caching proxies, detecting mirrors can be valuable to avoid redundant fetching, and knowledge of mirroring can be used to compensate for broken links.
Year
DOI
Venue
1999
10.1016/S1389-1286(99)00021-3
Computer Networks
Keywords
DocType
Volume
Mirroring,Content duplication,Smart proxies,Smart crawlers,Web statistics
Journal
31
Issue
ISSN
Citations 
11-16
Computer Networks
59
PageRank 
References 
Authors
8.51
10
2
Name
Order
Citations
PageRank
Krishna A. Bharat11211252.86
Andrei Broder27357920.20