Abstract | ||
---|---|---|
This paper shows that the accuracy of webpage classifiers can be improved by extracting meaningful strings with an unsupervised clustering method. Since webpage classification is different from original document classification with its words and phrases irregular, massive and unlabeled features, to cope with these features, we introduce two scenarios for extracting meaningful strings based on document clustering and term clustering with multi-strategies to optimize a vector space model (VSM). First, some candidate strings are used to build VSM; then, two scenarios performance on VSM respectively; last, we can extract meaningful strings from each cluster. So, theses meaningful string may represent documents comprehensively. The proposed method has been applied to webpage document classification and the results show that document clustering works better than term clustering. They also demonstrate that spectral clustering method outperforms k-means in document clustering case, conversely, make reduce performance in term clustering. However, a better synthetic performance can be obtained by spectral clustering with document clustering from some experiments. |
Year | DOI | Venue |
---|---|---|
2011 | 10.1109/NLPKE.2011.6138182 | NLPKE |
Keywords | DocType | Volume |
term clustering,pattern clustering,vector space model,pattern classification,web page classification,document classification,spectral clustering method,webpage classification,k-means,internet,spectral clustering,document clustering,unsupervised clustering method,document handling,string extraction algorithm,computer model,dictionaries,computational modeling,k means | Conference | null |
Issue | ISBN | Citations |
null | 978-1-61284-729-0 | 0 |
PageRank | References | Authors |
0.34 | 8 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Jie Chen | 1 | 2487 | 353.65 |
Jian Li | 2 | 811 | 52.97 |
Hao Liao | 3 | 51 | 5.37 |
Qingsheng Yuan | 4 | 2 | 2.12 |
Xiuguo Bao | 5 | 14 | 4.99 |