Title
Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations
Abstract
ABSTRACTPyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations. It aims to provide effective, reproducible, and easy-to-use first-stage retrieval in a multi-stage ranking architecture. Our toolkit is self-contained as a standard Python package and comes with queries, relevance judgments, pre-built indexes, and evaluation scripts for many commonly used IR test collections. We aim to support, out of the box, the entire research lifecycle of efforts aimed at improving ranking with modern neural approaches. In particular, Pyserini supports sparse retrieval (e.g., BM25 scoring using bag-of-words representations), dense retrieval (e.g., nearest-neighbor search on transformer-encoded representations), as well as hybrid retrieval that integrates both approaches. This paper provides an overview of toolkit features and presents empirical results that illustrate its effectiveness on two popular ranking tasks. Around this toolkit, our group has built a culture of reproducibility through shared norms and tools that enable rigorous automated testing.
Year
DOI
Venue
2021
10.1145/3404835.3463238
Research and Development in Information Retrieval
Keywords
DocType
Citations 
Open-Source Search Engine, First-Stage Retrieval
Conference
11
PageRank 
References 
Authors
0.66
0
6
Name
Order
Citations
PageRank
Jimmy Lin14800376.93
Xueguang Ma2132.10
Sheng-Chieh Lin3233.88
Jheng-Hong Yang4424.66
Ronak Pradeep5132.44
Rodrigo Nogueira6163.23