Abstract | ||
---|---|---|
Despite the growing interest in unsupervised learning, extracting meaningful knowledge from unlabelled audio remains an open challenge. To take a step in this direction, we recently proposed a problem-agnostic speech encoder (PASE), that combines a convolutional encoder followed by multiple neural networks, called workers, tasked to solve self-supervised problems (i.e., ones that do not require manual annotations as ground truth). PASE was shown to capture relevant speech information, including speaker voice-print and phonemes. This paper proposes PASE+, an improved version of PASE for robust speech recognition in noisy and reverberant environments. To this end, we employ an online speech distortion module, that contaminates the input signals with a variety of random disturbances. We then propose a revised encoder that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks. Finally, we refine the set of workers used in self-supervision to encourage better cooperation. Results on TIMIT, DIRHA and CHiME-5 show that PASE+ significantly outperforms both the previous version of PASE as well as common acoustic features. Interestingly, PASE+ learns transferable representations suitable for highly mismatched acoustic conditions. |
Year | DOI | Venue |
---|---|---|
2020 | 10.1109/ICASSP40776.2020.9053569 | ICASSP |
DocType | Citations | PageRank |
Conference | 1 | 0.37 |
References | Authors | |
0 | 7 |
Name | Order | Citations | PageRank |
---|---|---|---|
Mirco Ravanelli | 1 | 185 | 17.87 |
Zhong Jianyuan | 2 | 11 | 0.93 |
Santiago Pascual | 3 | 62 | 3.98 |
Swietojanski Pawel | 4 | 1 | 0.37 |
João Bosco Oliveira Monteiro | 5 | 24 | 8.87 |
Jan Trmal | 6 | 235 | 20.91 |
Yoshua Bengio | 7 | 42677 | 3039.83 |