Title
Graph structure in the web: aggregated by pay-level domain
Abstract
Previous research on the overall graph structure of the World Wide Web mostly focused on the page level, meaning that the graph that directly results from hyperlinks between individual web pages was analyzed. This paper aims to provide additional insights about the macroscopic structure of the World Web Web by analyzing an aggregated version of a recent web graph. The graph covers over 3.5 billion web pages and 128 billion hyperlinks between pages. It was crawled in the first half of 2012. We aggregate this graph by pay-level domain (PLD), meaning that all pages that belong to the same pay-level domain are represented by a single node and that an arc exists between two nodes if there is at least one hyperlink between pages of the corresponding pay-level domains. The resulting PLD graph covers 43 million PLDs and contains 623 million arcs between PLDs. Analyzing this aggregated graph allows us to present findings about linkage patterns between complete websites and not only individual HTML pages. In this paper, we present basic statistics about the PLD graph, such as degree distributions, top-ranked PLDs, distances and diameter. We analyze whether the bow-tie structure introduced by Broder et al. can also be identified in our PLD graph and reveal a backbone of highly interlinked websites within the graph. We group the websites by top-level domain and report findings about the overall linkage within and between different top-level domains. In a last experiment, we use data from the Open Directory Project (DMOZ) to categorize websites by topic and report findings about linkage patterns between websites belonging to different topical categories.
Year
DOI
Venue
2014
10.1145/2615569.2615674
WebSci
Keywords
Field
DocType
web graph,systems and software,world wide web,network analysis,web mining,web science,graph analysis
Web science,World Wide Web,Graph database,Web mining,Information retrieval,Web page,Computer science,Directory,Power graph analysis,Hyperlink,Network analysis
Conference
Citations 
PageRank 
References 
13
0.76
15
Authors
3
Name
Order
Citations
PageRank
Oliver Lehmberg11799.59
Robert Meusel223416.62
Christian Bizer38448524.93