Abstract | ||
---|---|---|
We present the results of the first large-scale Turkish information retrieval experiments performed on a TREC-like test collection. The test bed, which has been created for this study, contains 95.5 million words, 408,305 documents, 72 ad hoc queries and has a size of about 800MB. All documents come from the Turkish newspaper Milliyet. We implement and apply simple to sophisticated stemmers and various query-document matching functions and show that truncating words at a prefix length of 5 creates an effective retrieval environment in Turkish. However, a lemmatizer-based stemmer provides significantly better effectiveness over a variety of matching functions. |
Year | DOI | Venue |
---|---|---|
2006 | 10.1145/1148170.1148288 | SIGIR |
Keywords | Field | DocType |
million word,lemmatizer-based stemmer,turkish text,trec-like test collection,sophisticated stemmers,better effectiveness,effective retrieval environment,test bed,large-scale information retrieval experiment,prefix length,large-scale turkish information retrieval,turkish newspaper,information retrieval,stemming,lemmatizer | Lemmatisation,Data mining,Turkish,Query language,Computer science,Prefix,Newspaper,Natural language processing,Artificial intelligence,Wireless ad hoc network,Text processing,Information retrieval,Information technology | Conference |
ISBN | Citations | PageRank |
1-59593-369-7 | 5 | 0.50 |
References | Authors | |
5 | 6 |
Name | Order | Citations | PageRank |
---|---|---|---|
Fazli Can | 1 | 581 | 94.63 |
Seyit Kocberber | 2 | 64 | 4.58 |
Erman Balcik | 3 | 26 | 1.54 |
Cihan Kaynak | 4 | 26 | 1.54 |
H. Cagdas Ocalan | 5 | 5 | 0.50 |
Onur M. Vursavas | 6 | 26 | 1.54 |