Diversified search evaluation: lessons from the NTCIR-9 INTENT task - Citegraph

Paper Info

Title
Diversified search evaluation: lessons from the NTCIR-9 INTENT task

Abstract
The evaluation of diversified web search results is a relatively new research topic and is not as well-understood as the time-honoured evaluation methodology of traditional IR based on precision and recall. In diversity evaluation, one topic may have more than one intent, and systems are expected to balance relevance and diversity. The recent NTCIR-9 evaluation workshop launched a new task called INTENT which included a diversified web search subtask that differs from the TREC web diversity task in several aspects: the choice of evaluation metrics, the use of intent popularity and per-intent graded relevance, and the use of topic sets that are twice as large as those of TREC. The objective of this study is to examine whether these differences are useful, using the actual data recently obtained from the NTCIR-9 INTENT task. Our main experimental findings are: (1) The $$\hbox{D}\,\sharp$$ evaluation framework used at NTCIR provides more "intuitive" and statistically reliable results than Intent-Aware Expected Reciprocal Rank; (2) Utilising both intent popularity and per-intent graded relevance as is done at NTCIR tends to improve discriminative power, particularly for $$\hbox{D}\,\sharp$$ -nDCG; and (3) Reducing the topic set size, even by just 10 topics, can affect not only significance testing but also the entire system ranking; when 50 topics are used (as in TREC) instead of 100 (as in NTCIR), the system ranking can be substantially different from the original ranking and the discriminative power can be halved. These results suggest that the directions being explored at NTCIR are valuable.

Year	DOI	Venue
2013	10.1007/s10791-012-9208-x	Inf. Retr.
Keywords	Field	DocType
evaluation metrics,topic set,diversity evaluation,discriminative power,new research topic,per-intent graded relevance,time-honoured evaluation methodology,diversified search evaluation,evaluation framework,intent popularity,ntcir-9 intent task,ntcir-9 evaluation workshop	Data mining,Learning to rank,Reciprocal,Significance testing,Ranking,Information retrieval,Computer science,Precision and recall,As is,Popularity,Discriminative model	Journal
Volume	Issue	ISSN
16	4	1573-7659
Citations	PageRank	References
14	0.71	24
Authors
2

Authors (2 rows)

Cited by (14 rows)

References (24 rows)

Name	Order	Citations	PageRank
Tetsuya Sakai	1	1460	139.97
Ruihua Song	2	1138	59.33

1