Finding the better indexing units for Chinese information retrieval - Citegraph

Paper Info

Title
Finding the better indexing units for Chinese information retrieval

Abstract
In the processing of Chinese documents and queries in information retrieval (IR), one has to identify the units that are used as indexes. Words and n-grams had been used as indexes in several previous studies, which showed that both kinds of indexes lead to comparable IR performances. In this study, we carried out more experiments to find the better way to index Chinese texts. First, we investigated the inpacts on IR performance of the accuracy of word segmentation. Second, fifteen different groups of indexing units, which were the possible combination of words and character n-grams, were discussed detailedly. Experiments showed that better segmentation results in better IR performances, and a combination of words with uni-grams is the better choice to index Chinese texts for IR.

Year	DOI	Venue
2002	10.3115/1118824.1118828	SIGHAN@COLING
Keywords	Field	DocType
ir performance,better indexing unit,character n-grams,index chinese text,word segmentation,better ir performance,better choice,chinese document,better segmentation result,chinese information retrieval,comparable ir performance,possible combination	Information retrieval,Segmentation,Computer science,Search engine indexing,Text segmentation,Artificial intelligence,Natural language processing	Conference
Citations	PageRank	References
3	0.41	5
Authors
4

Authors (4 rows)

Cited by (3 rows)

References (5 rows)

Name	Order	Citations	PageRank
Hongzhao He	1	74	4.92
Pilian He	2	29	7.46
Jianfeng Gao	3	5729	296.43
Huang Changning	4	628	48.12

1