Title
Key Information Expansion Applied in Spoken Document Classification based on Lattice.
Abstract
Traditionally, query words or key words in spoken document classification are generated by manual. In this paper, based on CHI-square, TFIDF and maximum poster probability (MPP) features, a new hybrid feature for key information extraction is proposed. It can combine the advantages of these three features, and the weight of each word in hybrid feature can be further integrated into the classification system. Here, the weights of key words can reveal the relationship between words and topic to some extent. Furthermore, when the query words or key words are not enough, key information expansion part based on focus score can be added to dig the latent information about the topic. In the key information expansion part, not only the documents with key words occurring but also the other documents with no key word participate into the expansion procedure. Additionally, in the classification system, document length as prior information is adopted when no query is found. The whole classification system is based on lattice, which has more information than 1-best result in speech recognition system. Among CHI-square, TFIDF and MPP, the system performance of MPP is a little worse than the others. CHI-square is a little better than TFIDF when the key words number is increasing. Among these feature, hybrid feature can almost obtain the best performance under the same condition. Combined with document length information, the classification system performance is further enhanced, especially for less key information condition. Experiments show that when the system is combined weight and document length information, hybrid feature can obtain the best performance with a MAP of 0.7817 under 50 key words. When key information is not enough, key information expansion can improve the system performance when only 1, 5, 10 key words here. In the proposed key information expansion approach, since the focus factor is introduced to adjust the effect of documents with no key words, some empty words can be avoided to some extent, and the number of expansion words can be under control. © 2011 ACADEMY PUBLISHER.
Year
DOI
Venue
2011
10.4304/jcp.6.5.923-930
JCP
Keywords
Field
DocType
document length,hybrid feature,key information extraction,lattice,spoken document classification
Document classification,Lattice (order),Information retrieval,tf–idf,Pattern recognition,Computer science,Natural key,A little better,Information extraction,Artificial intelligence
Journal
Volume
Issue
Citations 
6
5
1
PageRank 
References 
Authors
0.36
11
3
Name
Order
Citations
PageRank
Lei Zhang127122.04
Zhuo Zhang2578.77
Xue-Zhi Xiang310.69