Title
A new method to compose long unknown Chinese keywords
Abstract
There is now a huge amount of electronic documents stored on the internet. In order to retrieve information from this data, each document is commonly represented as a set of keywords, and then all documents are analysed based on the set of discriminative words. In information retrieval the recognition of words in articles is an essential step; however, unlike English, Chinese words are not distinguished by spaces. Therefore, many approaches have been devised to parse Chinese words. The dictionary-based approach is commonly used in most current systems for text segmentation. However, general purpose dictionaries are not always able to provide proper references to accurately parse the domain-specific words, especially with unknown words. This paper aims to propose a new method for classifying longer keywords from Chinese documents by incorporating previously unknown keywords into a keyword list without the effort of building domain-specific dictionaries. Our method first utilizes the parsed words from existing parsers and filters the keywords utilizing term frequency-inverse document frequency (TF-IDF) values; further, based on the parsed words and keywords, a T tree is used to store the candidates for composing unknown words. The candidates are evaluated by an unknown word (UW) coefficient threshold, i.e. newly composed words are deemed as newly discovered unknown words if their UW coefficient is higher than a pre-defined threshold. Finally, the parsed words and newly composed words are re-filtered to form long keywords. The results of several experiments comparing the results with Google and Yahoo show that, regardless of recall rates, precision rates and F-measures, our proposed method significantly outperforms other methods.
Year
DOI
Venue
2012
10.1177/0165551512442481
J. Information Science
Keywords
DocType
Volume
Chinese word,unknown Chinese keyword,UW coefficient,parsed word,coefficient threshold,new method,Chinese document,unknown keyword,keywords utilizing term frequency-inverse,proposed method,unknown word
Journal
38
Issue
ISSN
Citations 
4
0165-5515
2
PageRank 
References 
Authors
0.37
25
2
Name
Order
Citations
PageRank
Yu-Chin Liu1123.96
Chun-Wei Lin220.37