Title
An Improved Genetic Algorithm for Document Clustering with Semantic Similarity Measure
Abstract
This paper proposes a self-organized genetic algorithm for document clustering based on semantic similarity measure. The traditional method to represent text is that the document is organized as a string of words, while the conceptual similarity is ignored. We take advantage of thesaurus-based ontology to overcome this problem. To investigate how ontology method could be used effectively in document clustering, a hybrid strategy which combines the thesaurus-based semantic similarity measure and vector space model (VSM) measure to provide more accurate assessment of similarity between documents are implemented. Considering the influence between the diversity of the population and the selective pressure, an approach of dynamic evolution operators is put forward in this article. In our experiment two data sets of 200 and 600 documents from Reuter-21578 corpus are excerpted for test and the experiment results show that our method of genetic algorithm in conjunction with the hybrid semantic strategy, the combination of the thesaurus-based measure and VSM-based measure, outperforms that with the sole VSM measure. Our clustering algorithm also efficiently enhances the performance of precision and recall in comparison with k-means in the same similarity environments.
Year
DOI
Venue
2008
10.1109/ICNC.2008.374
ICNC
Keywords
Field
DocType
document clustering,genetic algorithm,semantic similarity measure,conceptual similarity,sole vsm measure,thesaurus-based measure,similarity environment,improved genetic algorithm,thesaurus-based semantic similarity measure,vsm-based measure,clustering algorithm,clustering algorithms,algorithm design and analysis,self organization,wordnet,clustering,gallium,genetic algorithms,k means,vector space model,semantic similarity
Semantic similarity,Population,Fuzzy clustering,Data mining,Computer science,Document clustering,Precision and recall,Vector space model,Cluster analysis,WordNet
Conference
Citations 
PageRank 
References 
0
0.34
9
Authors
2
Name
Order
Citations
PageRank
Wei Song111315.51
Soon Cheol Park219714.78