Reproduce. Generalize. Extend. On Information Retrieval Evaluation without Relevance Judgments - Citegraph

Paper Info

Title
Reproduce. Generalize. Extend. On Information Retrieval Evaluation without Relevance Judgments

Abstract
The evaluation of retrieval effectiveness by means of test collections is a commonly used methodology in the information retrieval field. Some researchers have addressed the quite fascinating research question of whether it is possible to evaluate effectiveness completely automatically, without human relevance assessments. Since human relevance assessment is one of the main costs of building a test collection, both in human time and money resources, this rather ambitious goal would have a practical impact. In this article, we reproduce the main results on evaluating information retrieval systems without relevance judgments; furthermore, we generalize such previous work to analyze the effect of test collections, evaluation metrics, and pool depth. We also expand the idea to semi-automatic evaluation and estimation of topic difficulty. Our results show that (i) previous work is overall reproducible, although some specific results are not; (ii) collection, metric, and pool depth impact the automatic evaluation of systems, which is anyway accurate in several cases; (iii) semi-automatic evaluation is an effective methodology; and (iv) automatic evaluation can (to some extent) be used to predict topic difficulty.

Year	DOI	Venue
2018	10.1145/3241064	Journal of Data and Information Quality
Keywords	Field	DocType
Test collections,automatic retrieval evaluation,few topics,relevance judgments,reproducibility,topic difficulty	Research question,Information retrieval,Computer science	Journal
Volume	Issue	ISSN
10	3	1936-1955
Citations	PageRank	References
1	0.35	36
Authors
4

Authors (4 rows)

Cited by (1 rows)

References (36 rows)

Name	Order	Citations	PageRank
Kevin Roitero	1	30	13.74
Marco Passon	2	1	0.35
Giuseppe Serra	3	280	24.51
Stefano Mizzaro	4	862	85.52

1