Title
A unified approach to transfer learning of deep neural networks with applications to speaker adaptation in automatic speech recognition.
Abstract
In this paper, we present a unified approach to transfer learning of deep neural networks (DNNs) to address performance degradation issues caused by a potential acoustic mismatch between the training and testing conditions due to inter-speaker variability in state-of-the-art connectionist (a.k.a., hybrid) automatic speech recognition (ASR) systems. Different schemes to transfer knowledge of deep neural networks related to speaker adaptation can be developed with ease under such a unifying concept as demonstrated in the three frameworks investigated in this study. In the first solution, knowledge is transferred between homogeneous domains, namely the source and the target domains. Moreover the transfer takes place in a sequential manner from the target to the source speaker to boost the ASR accuracy on spoken utterances from a surprise target speaker. In the second solution, a multi-task approach is adopted to adjust the connectionist parameters to improve the ASR system performance on the target speaker. Knowledge is transferred simultaneously among heterogeneous tasks, and that is achieved by adding one or more smaller auxiliary output layers to the original DNN structure. In the third solution, DNN output classes are organised into a hierarchical structure in order to adjust the connectionist parameters and close the gap between training and testing conditions by transferring prior knowledge from the root node to the leaves in a structural maximum a posteriori fashion. Through a series of experiments on the Wall Street Journal (WSJ) speech recognition task, we show that the proposed solutions result in consistent and statistically significant word error rate reductions. Most importantly, we show that transfer learning is an enabling technology for speaker adaptation, since it outperforms both the transformation-based adaptation algorithms usually adapted in the speech community, and the multi-condition training (MCT) schemes, which is a data combination methods often adopted to cover more acoustic variabilities in speech when data from the source and target domains are both available at the training time. Finally, experimental evidence demonstrates that all proposed solutions are robust to negative transfer even when only a single sentence from the target speaker is available. HighlightsA paradigm to transfer learning of deep neural networks in automatic speech recognition systems is presented.Three different transfer learning solutions for deep neural network are developed and tested on large datasets.Experimental evidence shows that the proposed solutions outperform the state-of-the-art and avoid negative transfer.
Year
DOI
Venue
2016
10.1016/j.neucom.2016.09.018
Neurocomputing
Keywords
Field
DocType
Transfer learning,Speaker adaptation,Deep neural network,Multi-task learning
Multi-task learning,Negative transfer,Computer science,Transfer of learning,Word error rate,Speech recognition,Speaker recognition,Artificial intelligence,Speaker diarisation,Artificial neural network,Connectionism,Machine learning
Journal
Volume
Issue
ISSN
218
C
0925-2312
Citations 
PageRank 
References 
16
0.60
36
Authors
3
Name
Order
Citations
PageRank
Zhen Huang110011.60
Sabato Marco Siniscalchi231030.21
Chin-Hui Lee36101852.71