Title
Decorrelating Feature Spaces for Learning General-Purpose Audio Representations
Abstract
Inspired by the recent progress in self-supervised learning for computer vision, in this paper, through the <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">DeLoRes</b> ( <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">D</i> ecorrelating latent spaces for <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Lo</i> w <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Res</i> ource audio representation learning) framework, we introduce two new general-purpose audio representation learning approaches, the DeLoRes-S and DeLoRes-M. Our main objective is to make our network learn representations in a resource-constrained setting (both data and compute) that can generalize well across a diverse set of downstream tasks. Inspired by the Barlow Twins objective function, we propose learning embeddings invariant to distortions of an input audio sample while ensuring that they contain non-redundant information about the sample. We call this the DeLoRes learning framework, which we employ in different fashions with the DeLoRes-S and DeLoRes-M. In our experiments, we learn audio representations with less than half the number of model parameters and 10% audio samples compared to state-of-the-art algorithms to achieve state-of-the-art results on 7 out of 11 tasks on linear evaluation and 4 out of 11 tasks in the finetuning setup. In addition to being simple and intuitive, our pre-training procedure is amenable to compute through its inherent nature of construction. Furthermore, we conduct extensive ablation studies on our training algorithm, model architecture, and results and make all our code and pre-trained models publicly available <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> .
Year
DOI
Venue
2022
10.1109/JSTSP.2022.3202093
IEEE Journal of Selected Topics in Signal Processing
Keywords
DocType
Volume
Self-supervised learning,audio classification,representation learning
Journal
16
Issue
ISSN
Citations 
6
1932-4553
0
PageRank 
References 
Authors
0.34
8
4
Name
Order
Citations
PageRank
Sreyan Ghosh102.37
Ashish Seth253.27
Sandesh V Katta300.34
Srinivasan Umesh49316.31