Title
Reproduce. Generalize. Extend. On Information Retrieval Evaluation without Relevance Judgments
Abstract
The evaluation of retrieval effectiveness by means of test collections is a commonly used methodology in the information retrieval field. Some researchers have addressed the quite fascinating research question of whether it is possible to evaluate effectiveness completely automatically, without human relevance assessments. Since human relevance assessment is one of the main costs of building a test collection, both in human time and money resources, this rather ambitious goal would have a practical impact. In this article, we reproduce the main results on evaluating information retrieval systems without relevance judgments; furthermore, we generalize such previous work to analyze the effect of test collections, evaluation metrics, and pool depth. We also expand the idea to semi-automatic evaluation and estimation of topic difficulty. Our results show that (i) previous work is overall reproducible, although some specific results are not; (ii) collection, metric, and pool depth impact the automatic evaluation of systems, which is anyway accurate in several cases; (iii) semi-automatic evaluation is an effective methodology; and (iv) automatic evaluation can (to some extent) be used to predict topic difficulty.
Year
DOI
Venue
2018
10.1145/3241064
Journal of Data and Information Quality
Keywords
Field
DocType
Test collections,automatic retrieval evaluation,few topics,relevance judgments,reproducibility,topic difficulty
Research question,Information retrieval,Computer science
Journal
Volume
Issue
ISSN
10
3
1936-1955
Citations 
PageRank 
References 
1
0.35
36
Authors
4
Name
Order
Citations
PageRank
Kevin Roitero13013.74
Marco Passon210.35
Giuseppe Serra328024.51
Stefano Mizzaro486285.52