Title
Improving children's mismatched ASR using structured low-rank feature projection.
Abstract
The work presented in this paper explores the issues in automatic speech recognition (ASR) of children’s speech on acoustic models trained on adults’ speech. In such contexts, due to a large acoustic mismatch between training and test data, highly degraded recognition rates are noted. Even with the use of vocal tract length normalization (VTLN), the mismatched case recognition performance is still much below that for the matched case. Our earlier studies have shown that, for commonly used mel-filterbank-based cepstral features, the acoustic mismatch is exacerbated by insufficient smoothing of pitch harmonics for child speakers. To address this problem, a structured low-rank projection of the features vectors prior to learning the acoustic models as well as before decoding is proposed in this paper. To accomplish this, first a low-rank transform is learned on the training data (adults’ speech). Any dimensionality reduction technique which depends on the variance of the training data may be used for this purpose. In this work, principal component analysis and heteroscedastic linear discriminant analysis have been explored for the same. When the derived low-rank projection is applied in the mismatched testing case, it alleviates the pitch-dependent mismatch. The proposed approach provides a relative recognition performance improvement of 35% over the VTLN included baseline for the children’s mismatched ASR employing acoustic modeling based on hidden Markov models (HMM) with observation densities modeled using Gaussian mixture models (GMM). In addition to that, other acoustic modeling approaches based on subspace GMM (SGMM) and deep neural networks (DNN) have also been explored. Projecting the data to a lower-dimensional subspace is found to be effective in those frameworks as well. In the case of SGMM and DNN-based systems, the proposed approach is noted to result in relative recognition performance improvements of 33% and 21%, respectively, over their corresponding baselines.
Year
DOI
Venue
2018
10.1016/j.specom.2018.11.001
Speech Communication
Keywords
Field
DocType
Children’s speech recognition,Pitch variation,Low-rank feature projection,PCA,HLDA,SGMM,DNN
Normalization (statistics),Dimensionality reduction,Pattern recognition,Subspace topology,Computer science,Cepstrum,Speech recognition,Smoothing,Test data,Artificial intelligence,Hidden Markov model,Mixture model
Journal
Volume
ISSN
Citations 
105
0167-6393
2
PageRank 
References 
Authors
0.45
32
4
Name
Order
Citations
PageRank
S. Shahnawazuddin16417.34
Hemant Kumar Kathania2194.27
Abhishek Dey350.85
Rohit Sinha423130.54