Title
Parallelized Near-Duplicate Document Detection Algorithm for Large Scale Chinese Web Pages
Abstract
A large scale of duplicate and near-duplicate web pages on the Internet create a lot of problems for search engines. Currently each single duplicate and near-duplicate web document detection algorithms cannot achieve both good performance and accuracy. Also most of them are designed to process English documents and not able to use for Chinese documents. This paper presents an integrated algorithm, KMatch, for near-duplicate document detection of large scale Chinese Web pages. First of all, KMatch employs Chinese segmentation algorithm to prepare Chinese words into meaningful features to compress documents. Then keywords matching technique is used to improve the accuracy of document detection. For further accuracy improvement, KMatch also combines IMatch algorithms to filter out the noise contents of a web document and retain the body text. To improve detection performance, we integrate the Shingling algorithm to compress huge datasets into smaller ones. Finally, to further improve the detection performance on large scale Chinese web pages, we design and implement KMatch algorithm in parallel with MapReduce. The experimental results show that our approach achieves both high precision and recall, and the parallelized algorithm with MapReduce achieves good performance and scalability when dealing with large scale of datasets.
Year
DOI
Venue
2012
10.1109/PDCAT.2012.108
PDCAT
Keywords
Field
DocType
imatch algorithm,detection performance,imatch algorithms,mapreduce,large scale,large scale chinese web pages,chinese documents,web document detection algorithms,chinese document,chinese segmentation algorithm,kmatch algorithm,near-duplicate document detection,data compression,large scale web documents,shingling algorithm,parallelized near-duplicate document detection,web sites,internet,shingling,good performance,large scale chinese web,natural language processing,kmatch,keywords matching,english documents,document handling,chinese word,document compression,search engines,imatch,distributed processing,chinese web pages,parallelized near-duplicate document detection algorithm,chinese web page
Web page,Information retrieval,Shingling,Computer science,Segmentation,Precision and recall,Algorithm,Data compression,Body text,The Internet,Scalability
Conference
ISBN
Citations 
PageRank 
978-0-7695-4879-1
0
0.34
References 
Authors
17
4
Name
Order
Citations
PageRank
Yongzhuang Wei16916.94
Shuai Wang22012.04
Chunfeng Yuan341830.84
Yihua Huang400.34