Parallelized Near-Duplicate Document Detection Algorithm for Large Scale Chinese Web Pages - Citegraph

Paper Info

Title
Parallelized Near-Duplicate Document Detection Algorithm for Large Scale Chinese Web Pages

Abstract
A large scale of duplicate and near-duplicate web pages on the Internet create a lot of problems for search engines. Currently each single duplicate and near-duplicate web document detection algorithms cannot achieve both good performance and accuracy. Also most of them are designed to process English documents and not able to use for Chinese documents. This paper presents an integrated algorithm, KMatch, for near-duplicate document detection of large scale Chinese Web pages. First of all, KMatch employs Chinese segmentation algorithm to prepare Chinese words into meaningful features to compress documents. Then keywords matching technique is used to improve the accuracy of document detection. For further accuracy improvement, KMatch also combines IMatch algorithms to filter out the noise contents of a web document and retain the body text. To improve detection performance, we integrate the Shingling algorithm to compress huge datasets into smaller ones. Finally, to further improve the detection performance on large scale Chinese web pages, we design and implement KMatch algorithm in parallel with MapReduce. The experimental results show that our approach achieves both high precision and recall, and the parallelized algorithm with MapReduce achieves good performance and scalability when dealing with large scale of datasets.

Year	DOI	Venue
2012	10.1109/PDCAT.2012.108	PDCAT
Keywords	Field	DocType
imatch algorithm,detection performance,imatch algorithms,mapreduce,large scale,large scale chinese web pages,chinese documents,web document detection algorithms,chinese document,chinese segmentation algorithm,kmatch algorithm,near-duplicate document detection,data compression,large scale web documents,shingling algorithm,parallelized near-duplicate document detection,web sites,internet,shingling,good performance,large scale chinese web,natural language processing,kmatch,keywords matching,english documents,document handling,chinese word,document compression,search engines,imatch,distributed processing,chinese web pages,parallelized near-duplicate document detection algorithm,chinese web page	Web page,Information retrieval,Shingling,Computer science,Segmentation,Precision and recall,Algorithm,Data compression,Body text,The Internet,Scalability	Conference
ISBN	Citations	PageRank
978-0-7695-4879-1	0	0.34
References	Authors
17	4

Authors (4 rows)

Cited by (0 rows)

References (17 rows)

Name	Order	Citations	PageRank
Yongzhuang Wei	1	69	16.94
Shuai Wang	2	20	12.04
Chunfeng Yuan	3	418	30.84
Yihua Huang	4	0	0.34

1