Abstract | ||
---|---|---|
Stemmers attempt to reduce a word to its stem or root form and are used widely in information retrieval tasks to increase the recall rate. Most popular stemmers encode a large number of language-specific rules built over a length of time. Such stemmers with comprehensive rules are available only for a few languages. In the absence of extensive linguistic resources for certain languages, statistical language processing tools have been successfully used to improve the performance of IR systems. In this article, we describe a clustering-based approach to discover equivalence classes of root words and their morphological variants. A set of string distance measures are defined, and the lexicon for a given text collection is clustered using the distance measures to identify these equivalence classes. The proposed approach is compared with Porter's and Lovin's stemmers on the AP and WSJ subcollections of the Tipster dataset using 200 queries. Its performance is comparable to that of Porter's and Lovin's stemmers, both in terms of average precision and the total number of relevant documents retrieved. The proposed stemming algorithm also provides consistent improvements in retrieval performance for French and Bengali, which are currently resource-poor. |
Year | DOI | Venue |
---|---|---|
2007 | 10.1145/1281485.1281489 | ACM Trans. Inf. Syst. |
Keywords | Field | DocType |
string similarity acm reference format:,indian languages,distance measure,suffix stripper,french,root word,additional key words and phrases: bengali,stem- ming,corpus,equivalence class,information retrieval task,retrieval performance,large number,clustering-based approach,popular stemmers,root form,clustering,string similarity,document retrieval,stemming,bengali | ENCODE,Suffix,Information retrieval,Computer science,Bengali,Lexicon,Natural language processing,Artificial intelligence,Equivalence class,String metric,Cluster analysis,Distance measures | Journal |
Volume | Issue | ISSN |
25 | 4 | 1046-8188 |
Citations | PageRank | References |
40 | 2.30 | 13 |
Authors | ||
6 |
Name | Order | Citations | PageRank |
---|---|---|---|
Prasenjit Majumder | 1 | 173 | 25.15 |
Mandar Mitra | 2 | 3092 | 338.20 |
Swapan K. Parui | 3 | 549 | 59.24 |
Gobinda Kole | 4 | 40 | 2.30 |
Pabitra Mitra | 5 | 1729 | 126.79 |
Kalyankumar Datta | 6 | 50 | 3.60 |