Title
Word Cloud Model for Text Categorization
Abstract
In centroid-based classification, each class is represented by a prototype or centroid document vector that is formed by averaging all member vectors during the training phase. In the prediction phase, the label of a test document vector is assigned to that of its nearest class prototype. Recently there has been revived interest in reformulating the prototype/centroid to improve classification performance. In this paper, we study the theoretical properties of the recently proposed Class Feature Centroid (CFC) classifier by considering the rate of change of each prototype vector with respect to individual dimensions (terms). The implication of our theoretical finding is that CFC is inherently biased towards large (dominant majority) classes, which means it is destined to perform poorly for highly class-imbalanced data. Another practical concern about CFC lies in its overly-aggressive design in weeding out terms that appear in all classes. To overcome these CFC limitations while retaining its intrinsic and worthy design goals, we propose an improved and robust centroid-based classifier that uses precise term-class distribution properties instead of simple presence or absence of terms in classes. Specifically, terms are weighted based on the Kullback-Leibler divergence measure between pairs of class-conditional term probabilities, we call this the CFC-KL centroid classifier. We then generalized CFC-KL to handle multi-class data by summing pair wise class-conditioned word probability ratios. Our proposed approach has been evaluated on 5 datasets, on which it consistently outperforms CFC and the baseline Support Vector Machine classifier. We also devise a word cloud visualization approach to highlight the important class-specific words picked out by our CFC-KL, and visually compare it with other popular term weigthing approaches. Our encouraging results show that the centroid based generalized CFC-KL classifier is both robust and efficient to deal with real-world text classification.
Year
DOI
Venue
2011
10.1109/ICDM.2011.156
ICDM
Keywords
Field
DocType
classification,data mining,data visualisation,probability,text analysis,CFC classifier,CFC-KL centroid classifier,Kullback-Leibler divergence measure,centroid-based classification,class feature centroid,class-conditional term probability,pair wise class-conditioned word probability ratio,term-class distribution,test document vector,text categorization,text classification,word cloud model,word cloud visualization approach,Centroid-based Classification,Text Categorization
Data mining,Divergence,Computer science,Support vector machine classifier,Artificial intelligence,Classifier (linguistics),Text categorization,Data visualization,Pattern recognition,Visualization,Tag cloud,Machine learning,Centroid
Conference
Citations 
PageRank 
References 
3
0.41
13
Authors
3
Name
Order
Citations
PageRank
Tam T. Nguyen1786.79
Kuiyu Chang291760.50
Siu Cheung Hui3110686.71