Title
Learning URL patterns for webpage de-duplication
Abstract
Presence of duplicate documents in the World Wide Web adversely affects crawling, indexing and relevance, which are the core building blocks of web search. In this paper, we present a set of techniques to mine rules from URLs and utilize these rules for de-duplication using just URL strings without fetching the content explicitly. Our technique is composed of mining the crawl logs and utilizing clusters of similar pages to extract transformation rules, which are used to normalize URLs belonging to each cluster. Preserving each mined rule for de-duplication is not efficient due to the large number of such rules. We present a machine learning technique to generalize the set of rules, which reduces the resource footprint to be usable at web-scale. The rule extraction techniques are robust against web-site specific URL conventions. We compare the precision and scalability of our approach with recent efforts in using URLs for de-duplication. Experimental results demonstrate that our approach achieves 2 times more reduction in duplicates with only half the rules compared to the most recent previous approach. Scalability of the framework is demonstrated by performing a large scale evaluation on a set of 3 Billion URLs, implemented using the MapReduce framework.
Year
DOI
Venue
2010
10.1145/1718487.1718535
WSDM
Keywords
Field
DocType
recent previous approach,large scale evaluation,url string,transformation rule,rule extraction technique,mined rule,recent effort,webpage de-duplication,mapreduce framework,large number,learning url pattern,billion urls,search engine,decision trees,search engines,world wide web,decision tree,generalization,machine learning
Data deduplication,Data mining,Decision tree,Information retrieval,Web page,Computer science,URL normalization,Search engine indexing,Rewrite engine,Semantic URL,Scalability
Conference
Citations 
PageRank 
References 
20
0.75
15
Authors
6
Name
Order
Citations
PageRank
Hema Swetha Koppula1106741.30
Krishna P. Leela2332.79
Amit Agarwal3281.35
Krishna Prasad Chitrapura41035.71
Sachin Garg580875.97
Amit Sasturkar61748.17