Speech recognition in reverberant and noisy environments employing multiple feature extractors and i-vector speaker adaptation - Citegraph

Paper Info

Title
Speech recognition in reverberant and noisy environments employing multiple feature extractors and i-vector speaker adaptation

Abstract
The REVERB challenge provides a common framework for the evaluation of feature extraction techniques in the presence of both reverberation and additive background noise. State-of-the-art speech recognition systems perform well in controlled environments, but their performance degrades in realistic acoustical conditions, especially in real as well as simulated reverberant environments. In this contribution, we utilize multiple feature extractors including the conventional mel-filterbank, multi-taper spectrum estimation-based mel-filterbank, robust mel and compressive gammachirp filterbank, iterative deconvolution-based dereverberated mel-filterbank, and maximum likelihood inverse filtering-based dereverberated mel-frequency cepstral coefficient features for speech recognition with multi-condition training data. In order to improve speech recognition performance, we combine their results using ROVER (Recognizer Output Voting Error Reduction). For two- and eight-channel tasks, to get benefited from the multi-channel data, we also use ROVER, instead of the multi-microphone signal processing method, to reduce word error rate by selecting the best scoring word at each channel. As in a previous work, we also apply i-vector-based speaker adaptation which was found effective. In speech recognition task, speaker adaptation tries to reduce mismatch between the training and test speakers. Speech recognition experiments are conducted on the REVERB challenge 2014 corpora using the Kaldi recognizer. In our experiments, we use both utterance-based batch processing and full batch processing. In the single-channel task, full batch processing reduced word error rate (WER) from 10.0 to 9.3 % on SimData as compared to utterance-based batch processing. Using full batch processing, we obtained an average WER of 9.0 and 23.4 % on the SimData and RealData, respectively, for the two-channel task, whereas for the eight-channel task on the SimData and RealData, the average WERs found were 8.9 and 21.7 %, respectively.

Year	DOI	Venue
2015	10.1186/s13634-015-0238-6	EURASIP Journal on Advances in Signal Processing
Keywords	Field	DocType
Speech recognition, Multiple window, Filterbank features, Dereverberation, I-vectors, DNN-HMM, GMM-HMM	Signal processing,Speech processing,Pattern recognition,Computer science,Cepstrum,Word error rate,Filter (signal processing),Speech recognition,Feature extraction,Speaker recognition,Batch processing,Artificial intelligence	Journal
Volume	Issue	ISSN
2015	1	1687-6180
Citations	PageRank	References
2	0.37	25
Authors
4

Authors (4 rows)

Cited by (2 rows)

References (25 rows)

Name	Order	Citations	PageRank
jahangir alam	1	320	38.69
Vishwa Gupta	2	18	4.17
Patrick Kenny	3	2700	214.80
Pierre Dumouchel	4	34	5.61

1