Title
Fast crawling methods of exploring content distributed over large graphs
Abstract
Despite recent effort to estimate topology characteristics of large graphs (e.g., online social networks and peer-to-peer networks), little attention has been given to develop a formal crawling methodology to characterize the vast amount of content distributed over these networks. Due to the large-scale nature of these networks and a limited query rate imposed by network service providers, exhaustively crawling and enumerating content maintained by each vertex is computationally prohibitive. In this paper, we show how one can obtain content properties by crawling only a small fraction of vertices and collecting their content. We first show that when sampling is naively applied, this can produce a huge bias in content statistics (i.e., average number of content replicas). To remove this bias, one may use maximum likelihood estimation to estimate content characteristics. However, our experimental results show that this straightforward method requires to sample most vertices to obtain accurate estimates. To address this challenge, we propose two efficient estimators: special copy estimator (SCE) and weighted copy estimator (WCE) to estimate content characteristics using available information in sampled content. SCE uses the special content copy indicator to compute the estimate, while WCE derives the estimate based on meta-information in sampled vertices. We conduct experiments on a variety of real-word and synthetic datasets, and the results show that WCE and SCE are cost effective and also “asymptotically unbiased”. Our methodology provides a new tool for researchers to efficiently query content distributed in large-scale networks.
Year
DOI
Venue
2019
10.1007/s10115-018-1178-x
Knowledge and Information Systems
Keywords
Field
DocType
Crawling,Online social networks,Sampling,Random walks
Graph,Data mining,Social network,Crawling,Vertex (geometry),Computer science,Random walk,Maximum likelihood,Sampling (statistics),Estimator
Journal
Volume
Issue
ISSN
59.0
1.0
0219-3116
Citations 
PageRank 
References 
0
0.34
41
Authors
5
Name
Order
Citations
PageRank
Ping-Hui Wang123633.39
Junzhou Zhao210418.36
John C.S. Lui33680279.85
Don Towsley4186931951.05
X. Guan51169137.97