Title
Genetic algorithm for text clustering based on latent semantic indexing
Abstract
In this paper, we develop a genetic algorithm method based on a latent semantic model (GAL) for text clustering. The main difficulty in the application of genetic algorithms (GAs) for document clustering is thousands or even tens of thousands of dimensions in feature space which is typical for textual data. Because the most straightforward and popular approach represents texts with the vector space model (VSM), that is, each unique term in the vocabulary represents one dimension. Latent semantic indexing (LSI) is a successful technology in information retrieval which attempts to explore the latent semantics implied by a query or a document through representing them in a dimension-reduced space. Meanwhile, LSI takes into account the effects of synonymy and polysemy, which constructs a semantic structure in textual data. GA belongs to search techniques that can efficiently evolve the optimal solution in the reduced space. We propose a variable string length genetic algorithm which has been exploited for automatically evolving the proper number of clusters as well as providing near optimal data set clustering. GA can be used in conjunction with the reduced latent semantic structure and improve clustering efficiency and accuracy. The superiority of GAL approach over conventional GA applied in VSM model is demonstrated by providing good Reuter document clustering results.
Year
DOI
Venue
2009
10.1016/j.camwa.2008.10.010
Computers & Mathematics with Applications
Keywords
Field
DocType
document clustering,genetic algorithm,document representation model,feature space,textual data,latent semantic model,text clustering,latent semantics,latent semantic indexing,clustering efficiency,dimension-reduced space,information retrieval,semantic model,vector space model
Data mining,Canopy clustering algorithm,Clustering high-dimensional data,CURE data clustering algorithm,Correlation clustering,Document clustering,Computer science,Probabilistic latent semantic analysis,Constrained clustering,Cluster analysis
Journal
Volume
Issue
ISSN
57
11-12
Computers and Mathematics with Applications
Citations 
PageRank 
References 
27
1.07
13
Authors
2
Name
Order
Citations
PageRank
Wei Song111315.51
Soon Cheol Park219714.78