Title
Free-gram phrase identification for modeling Chinese text
Abstract
Vector space model using bag of phrases plays an important role in modeling Chinese text. However, the conventional way of using fixed gram scanning to identify free-length phrases is costly. To address this problem, we propose a novel approach for key phrase identification which is capable of identify phrases with all lengths and thus improves the coding efficiency and discrimination of the data representation. In the proposed method, we first convert each document into a context graph, a directed graph that encapsulates the statistical and positional information of all the 2-word strings in the document. We treat every transmission path in the graph as a hypothesis for a phrase, and select the corresponding phrase as a candidate phrase if the hypothesis is valid in the original document. Finally, we selectively divide some of the complex candidate phrases into sub-phrases to improve the coding efficiency, resulting in a set of phrases for codebook construction. The experiments on both balanced and unbalanced datasets show that the codebooks generated by our approach are more efficient than those by conventional methods (one syntactical method and three statistical methods are investigated). Furthermore, the data representation created by our approach has demonstrated higher discrimination than those by conventional methods in classification task.
Year
DOI
Venue
2013
10.1016/j.ipl.2012.11.005
Inf. Process. Lett.
Keywords
Field
DocType
novel approach,coding efficiency,key phrase identification,conventional method,complex candidate phrase,chinese text,free-gram phrase identification,free-length phrase,context graph,candidate phrase,corresponding phrase,data representation,information retrieval,sparse coding
Algorithmic efficiency,External Data Representation,Phrase search,Pattern recognition,Computer science,Neural coding,Directed graph,Phrase,Natural language processing,Artificial intelligence,Vector space model,Codebook
Journal
Volume
Issue
ISSN
113
4
0020-0190
Citations 
PageRank 
References 
2
0.36
23
Authors
5
Name
Order
Citations
PageRank
Xi Peng144723.84
Zhang Yi21765194.41
Xiao-Yong Wei320.36
Dezhong Peng428527.92
Yongsheng Sang551.09