Set Cover at Web Scale - Citegraph

Paper Info

Title
Set Cover at Web Scale

Abstract
The classic Set Cover problem requires selecting a minimum size subset A ⊆ F from a family of finite subsets F Of U such that the elements covered by A are the ones covered by F. It naturally occurs in many settings in web search, web mining and web advertising. The greedy algorithm that iteratively selects a set in F that covers the most uncovered elements, yields an optimum (1+ln \|U\|)-approximation but is inherently sequential. In this work we give the first MapReduce Set Cover algorithm that scales to problem sizes of ∼ 1 trillion elements and runs in logp Δ iterations for a nearly optimum approximation ratio of p ln Δ, where Δ is the cardinality of the largest set in F A web crawler is a system for bulk downloading of web pages. Given a set of seed URLs, the crawler downloads and extracts the hyperlinks embedded in them and schedules the crawling of the pages addressed by those hyperlinks for a subsequent iteration. While the average page out-degree is ∼ 50, the crawled corpus grows at a much smaller rate, implying a significant outlink overlap. Using our MapReduce Set Cover heuristic as a building block, we present the first large-scale seed generation algorithm that scales to ∼ 20 billion nodes and discovers new pages at a rate ∼ 4x faster than that obtained by prior art heuristics.

Year	DOI	Venue
2015	10.1145/2783258.2783315	ACM Knowledge Discovery and Data Mining
Field	DocType	Citations
Data mining,Set cover problem,Web mining,Web page,Computer science,Cardinality,Hyperlink,Artificial intelligence,Heuristic,Algorithm,Greedy algorithm,Web crawler,Machine learning	Conference	6
PageRank	References	Authors
0.44	22	2

Authors (2 rows)

Cited by (6 rows)

References (22 rows)

Name	Order	Citations	PageRank
Stergios Stergiou	1	339	15.84
Kostas Tsioutsiouliklis	2	752	35.84

1