Title
Classifying High-Dimensional Text and Web Data Using Very Short Patterns
Abstract
In this paper, we propose the "Democratic Classifier", a simple pattern-based classification algorithm that uses very short patterns for classification, and does not rely on the minimum support threshold. Borrowing ideas from democracy, our training phase allows each training instance to vote for an equal number of candidate size-2 patterns. The training instances select patterns by effectively balancing between local, class, and global significance of patterns. The selected patterns are simultaneously added to the model for all applicable classes and a novel power law based weighing scheme adjusts their weights with respect of each class. Results of experiments performed on 121 common text and web datasets show that our algorithm almost always outperforms state of the art classification algorithms, without any parameter tuning. On 100 real-life web datasets, the average absolute classification accuracy improvement was as great as 9.4% over SVM, Harmony, C4.5 and KNN. Also, our algorithm ran about 3.5 times faster than the fastest existing pattern-based classification algorithm.
Year
DOI
Venue
2008
10.1109/ICDM.2008.139
ICDM
Keywords
Field
DocType
fastest existing pattern-based classification,art classification algorithm,training instances select pattern,training phase,simple pattern-based classification algorithm,applicable class,real-life web datasets,training instance,web datasets,web data,classifying high-dimensional text,average absolute classification accuracy,short patterns,accuracy,power law,text analysis,internet,support vector machines,classification algorithms,data mining,feature selection,learning artificial intelligence,classification,computer science
Data mining,Text mining,Feature selection,Pattern recognition,Computer science,Support vector machine,Artificial intelligence,Almost surely,Statistical classification,Classifier (linguistics),Machine learning,The Internet
Conference
ISSN
Citations 
PageRank 
1550-4786
5
0.41
References 
Authors
21
2
Name
Order
Citations
PageRank
Hassan H. Malik1775.10
John R. Kender2627138.04