Title
Comparative Study of Indexing and Search Strategies for the Hindi, Marathi, and Bengali Languages
Abstract
The main goal of this article is to describe and evaluate various indexing and search strategies for the Hindi, Bengali, and Marathi languages. These three languages are ranked among the world’s 20 most spoken languages and they share similar syntax, morphology, and writing systems. In this article we examine these languages from an Information Retrieval (IR) perspective through describing the key elements of their inflectional and derivational morphologies, and suggest a light and more aggressive stemming approach based on them. In our evaluation of these stemming strategies we make use of the FIRE 2008 test collections, and then to broaden our comparisons we implement and evaluate two language independent indexing methods: the n-gram and trunc-n (truncation of the first n letters). We evaluate these solutions by applying our various IR models, including the Okapi, Divergence from Randomness (DFR) and statistical language models (LM) together with two classical vector-space approaches: tf idf and Lnu-ltc. Experiments performed with all three languages demonstrate that the I(ne)C2 model derived from the Divergence from Randomness paradigm tends to provide the best mean average precision (MAP). Our own tests suggest that improved retrieval effectiveness would be obtained by applying more aggressive stemmers, especially those accounting for certain derivational suffixes, compared to those involving a light stemmer or ignoring this type of word normalization procedure. Comparisons between no stemming and stemming indexing schemes shows that performance differences are almost always statistically significant. When, for example, an aggressive stemmer is applied, the relative improvements obtained are ~28% for the Hindi language, ~42% for Marathi, and ~18% for Bengali, as compared to a no-stemming approach. Based on a comparison of word-based and language-independent approaches we find that the trunc-4 indexing scheme tends to result in performance levels statistically similar to those of an aggressive stemmer, yet better than the 4-gram indexing scheme. A query-by-query analysis reveals the reasons for this, and also demonstrates the advantage of applying a stemming or a trunc-4 indexing scheme.
Year
DOI
Venue
2010
10.1145/1838745.1838748
ACM Trans. Asian Lang. Inf. Process.
Keywords
Field
DocType
various indexing,marathi language,aggressive stemmers,light stemmer,bengali languages,search engines for asian languages,trunc-4 indexing scheme,4-gram indexing scheme,search strategies,stemmer,natural language processing with indo-european languages,comparative study,hindi language,bengali language,aggressive stemmer,indic languages,indexing scheme,language independent indexing method,measurement,search engine,performance,indexation,vector space,natural language processing,algorithms,statistical significance,information retrieval,mean average precision
Ranking,tf–idf,Information retrieval,Computer science,Hindi,Search engine indexing,Bengali,Natural language processing,Artificial intelligence,Marathi,Syntax,Language model
Journal
Volume
Issue
Citations 
9
3
14
PageRank 
References 
Authors
0.72
28
2
Name
Order
Citations
PageRank
Ljiljana Dolamic112510.84
Jacques Savoy21601169.85