Title
How to build high quality L2R training data: Unsupervised compression-based selective sampling for learning to rank
Abstract
Learning to Rank (L2R) improves ranking quality but relies on the existence of manually labeled training sets, which are expensive and cumbersome to generate. Using automated labeling (e.g., clickthrough data) imposes its own challenges. Active learning (AL) can be used to gather high-quality training data by producing very informative yet small training sets. Cover, a method we have previously developed, allows for unsupervised sampling of training sets as good as those created using AL. In this paper we provide an extensive analysis of how and why Cover works. We revisit the method in a more formal way, with theorems and proofs, and provide additional empirical evidence of its practicality. We answer questions related to why Cover works so well and how its properties are related to AL methods. We show how certain characteristics of Cover’s clustering step allows it to more thoroughly explore the feature space by selecting query-document pairs that are representative and diverse, allowing L2R methods to produce effective models. The main novel contribution is a detailed analysis of the method’s inner workings and information-theoretic properties, allowing us to advance the understanding of L2R fundamentals through the lens of training set building.
Year
DOI
Venue
2022
10.1016/j.ins.2022.04.012
Information Sciences
Keywords
DocType
Volume
Active learning,Learning to rank,Ranking dataset creation,Dataset compression
Journal
601
ISSN
Citations 
PageRank 
0020-0255
0
0.34
References 
Authors
0
4
Name
Order
Citations
PageRank
Rodrigo M. Silva100.34
Guilherme C. M. Gomes200.34
Mario S. Alvim300.34
Marcos André Gonçalves42740191.03