Title
Statistical cross-language Web content quality assessment
Abstract
Cross-language Web content quality assessment plays an important role in many Web content processing applications. In the previous research, natural language processing, heuristic content and term frequency-inverse document frequency features based statistical systems have proven effective for Web content quality assessment. However, these are language-dependent features, which are not suitable for cross-language ranking. This paper proposes a cross-language Web content quality assessment method. First multi-modal language-independent features are extracted. The extracting features include character features, domain registration features, two-layer hyperlink analysis features and third-party Web service features. All the extracted features are then fused. Based on the fused features, feature selection is carried out to get a new eigenspace. Finally cross-language Web content quality model on the eigenspace can be learned. The experiments on ECML/PKDD 2010 Discovery Challenge cross-language datasets demonstrate that every scale feature has discriminability; different modalities of features are complementary to each other; and the feature selection is effective for statistical learning based cross-language Web content quality assessment.
Year
DOI
Venue
2012
10.1016/j.knosys.2012.05.018
Knowl.-Based Syst.
Keywords
Field
DocType
discovery challenge cross-language datasets,third-party web service feature,feature selection,web content processing application,cross-language web content quality,web content quality assessment,heuristic content,cross-language ranking,character feature,assessment method,feature extraction,web spam
Data mining,Heuristic,Feature selection,Information retrieval,Ranking,Computer science,Feature extraction,Hyperlink,Web service,Web content,Spamdexing
Journal
Volume
ISSN
Citations 
35,
0950-7051
2
PageRank 
References 
Authors
0.38
24
5
Name
Order
Citations
PageRank
Guanggang Geng114120.78
Liming Wang2138.75
Wei Wang327125.20
An-Lei Hu492.18
Shuo Shen5383.72