Title
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance
Abstract
This paper first reveals the relationship between Inverse Document Frequency (IDF), a global term weighting scheme, and information distance, a universal metric defined by Kolmogorov complexity. We concretely give a theoretical explanation that the IDF of a term is equal to the distance between the term and the empty string in the space of information distance in which the Kolmogorov complexity is approximated using Web documents and the Shannon-Fano coding. Based on our findings, we propose N-gram IDF, a theoretical extension of IDF for handling words and phrases of any length. By comparing weights among N-grams of any N, N-gram IDF enables us to determine dominant N-grams among overlapping ones and extract key terms of any length from texts without using any NLP techniques. To efficiently compute the weight for all possible N-grams, we adopt two string processing techniques, i.e., maximal substring extraction using enhanced suffix array and document listing using wavelet tree. We conducted experiments on key term extraction and Web search query segmentation, and found that N-gram IDF was competitive with state-of-the-art methods that were designed for each application using additional resources and efforts. The results exemplified the potential of N-gram IDF.
Year
DOI
Venue
2015
10.1145/2736277.2741628
WWW
Keywords
Field
DocType
Term Weighting, IDF, Multiword Expression, MED, Information Distance, Kolmogorov Complexity
Data mining,Substring,Weighting,Empty string,Computer science,Artificial intelligence,Web search query,Pattern recognition,tf–idf,Kolmogorov complexity,Information distance,Suffix array,Machine learning
Conference
Citations 
PageRank 
References 
9
1.01
35
Authors
3
Name
Order
Citations
PageRank
Masumi Shirakawa1517.69
Takahiro Hara21819193.85
Shojiro Nishio31853374.68