Title
Exploiting Wikipedia as external knowledge for document clustering
Abstract
In traditional text clustering methods, documents are represented as "bags of words" without considering the semantic information of each document. For instance, if two documents use different collections of core words to represent the same topic, they may be falsely assigned to different clusters due to the lack of shared core words, although the core words they use are probably synonyms or semantically associated in other forms. The most common way to solve this problem is to enrich document representation with the background knowledge in an ontology. There are two major issues for this approach: (1) the coverage of the ontology is limited, even for WordNet or Mesh, (2) using ontology terms as replacement or additional features may cause information loss, or introduce noise. In this paper, we present a novel text clustering method to address these two issues by enriching document representation with Wikipedia concept and category information. We develop two approaches, exact match and relatedness-match, to map text documents to Wikipedia concepts, and further to Wikipedia categories. Then the text documents are clustered based on a similarity metric which combines document content information, concept information as well as category information. The experimental results using the proposed clustering framework on three datasets (20-newsgroup, TDT2, and LA Times) show that clustering performance improves significantly by enriching document representation with Wikipedia concepts and categories.
Year
DOI
Venue
2009
10.1145/1557019.1557066
KDD
Keywords
Field
DocType
document clustering,core word,text document,wikipedia concept,category information,document content information,concept information,wikipedia category,document representation,information loss,external knowledge,semantic information,exploiting wikipedia,wikipedia,clustering algorithms,bag of words,text clustering
Ontology,Data mining,Infobox,Document clustering,Computer science,Explicit semantic analysis,Artificial intelligence,Natural language processing,WordNet,Cluster analysis,Information retrieval,Synonym,Document representation
Conference
Citations 
PageRank 
References 
155
4.23
11
Authors
5
Search Limit
100155
Name
Order
Citations
PageRank
Xiaohua Hu12819314.15
Xiaodan Zhang242922.61
Caimei Lu328813.01
E. K. Park42339.92
Xiaohua Zhou543825.82