Title
A webpage deletion algorithm based on hierarchical filtering
Abstract
Duplicate webpages can affect the user experience of search engine. This paper proposed webpage deletion algorithm based on hierarchical filtering according to the features of duplicate webpage. The webpage feature extraction is divided into three layers, which are paragraphs, sentences and words. The webpage features are formed by layer filtering redundant information. In the sentence layer paragraph sentences are extracted according to the sentence semantics, while in the word layer the sentences are denoised filtering based on statistics of the part of speech in them. This algorithm improves the noise immunity and the original coverage of the feature extraction. The experiments show that the proposed method can accurately filter out duplicate webpage.
Year
DOI
Venue
2012
10.1007/978-3-642-33469-6_68
WISM
Keywords
Field
DocType
feature extraction,sentence semantics,webpage feature,sentence layer paragraph sentence,webpage deletion algorithm,webpage feature extraction,duplicate webpage,duplicate webpages,word layer
Search engine,Web page,Information retrieval,Computer science,Algorithm,Filter (signal processing),Feature extraction,Part of speech,Paragraph,Sentence,Semantics
Conference
Citations 
PageRank 
References 
0
0.34
6
Authors
4
Name
Order
Citations
PageRank
Xunxun Chen155.49
Wei Wang242.78
Dapeng Man32910.54
Sichang Xuan400.34