Abstract | ||
---|---|---|
Presence of duplicate documents in the World Wide Web adversely affects crawling, indexing and relevance, which are the core building blocks of web search. In this paper, we present a set of techniques to mine rules from URLs and utilize these rules for de-duplication using just URL strings without fetching the content explicitly. Our technique is composed of mining the crawl logs and utilizing clusters of similar pages to extract transformation rules, which are used to normalize URLs belonging to each cluster. Preserving each mined rule for de-duplication is not efficient due to the large number of such rules. We present a machine learning technique to generalize the set of rules, which reduces the resource footprint to be usable at web-scale. The rule extraction techniques are robust against web-site specific URL conventions. We compare the precision and scalability of our approach with recent efforts in using URLs for de-duplication. Experimental results demonstrate that our approach achieves 2 times more reduction in duplicates with only half the rules compared to the most recent previous approach. Scalability of the framework is demonstrated by performing a large scale evaluation on a set of 3 Billion URLs, implemented using the MapReduce framework. |
Year | DOI | Venue |
---|---|---|
2010 | 10.1145/1718487.1718535 | WSDM |
Keywords | Field | DocType |
recent previous approach,large scale evaluation,url string,transformation rule,rule extraction technique,mined rule,recent effort,webpage de-duplication,mapreduce framework,large number,learning url pattern,billion urls,search engine,decision trees,search engines,world wide web,decision tree,generalization,machine learning | Data deduplication,Data mining,Decision tree,Information retrieval,Web page,Computer science,URL normalization,Search engine indexing,Rewrite engine,Semantic URL,Scalability | Conference |
Citations | PageRank | References |
20 | 0.75 | 15 |
Authors | ||
6 |
Name | Order | Citations | PageRank |
---|---|---|---|
Hema Swetha Koppula | 1 | 1067 | 41.30 |
Krishna P. Leela | 2 | 33 | 2.79 |
Amit Agarwal | 3 | 28 | 1.35 |
Krishna Prasad Chitrapura | 4 | 103 | 5.71 |
Sachin Garg | 5 | 808 | 75.97 |
Amit Sasturkar | 6 | 174 | 8.17 |