Title
A cascade gray-stereo visual feature extraction method for visual and audio-visual speech recognition.
Abstract
Develop a novel cascade feature extraction method for audio-visual speech recognition.Firstly show the depth visual information can significantly boost visual speech recognition.Firstly experimentally reveal different characteristics of grey and depth visual features.Introduced the first large-scale audio-visual speech corpus that contains depth information. Although stereo information has been extensively used in computer vision tasks recently, the incorporation of stereo visual information in Audio-Visual Speech Recognition (AVSR) systems and whether it can boost the speech accuracy still remains a largely undeveloped area. This paper addresses three fundamental issues in this area: 1) Will the stereo features benefit visual and audio-visual speech recognition? 2) If so, how much information is embedded in stereo features? 3) How to encode both planar and stereo information in a compact feature vector? In this study, we propose a comprehensive study on the characteristics of both planar and stereo visual features, and extensively analyse why the stereo information can boost the visual speech recognition. Based on the different information embedded in planar and stereo features, we present a new Cascade Hybrid Appearance Visual Feature (CHAVF) extraction scheme which successfully combines planar and stereo visual information into a compact feature vector, and evaluate this novel feature on visual and audio-visual connected digit recognition and isolated phrase recognition. The results show that stereo information is capable of significantly boosting the speech recognition, and the performance of our proposed visual feature outperforms the other commonly used appearance-based visual features on both the visual and audio-visual speech recognition tasks. Particularly, our proposed planar-stereo visual feature yields approximately 21% relative improvement over the planar visual feature. To the best of our knowledge, this is the first paper that extensively evaluates the different characteristics of planar and stereo visual features, and we first show that using the stereo feature along with the planar feature can significantly boost the accuracy on a large-scale audio-visual data corpus.
Year
DOI
Venue
2017
10.1016/j.specom.2017.01.005
Speech Communication
Keywords
Field
DocType
Audio-visual speech recognition,Planar-stereo visual information,Hybrid-level visual feature
Speech corpus,Computer science,Phrase,Audio-visual speech recognition,Artificial intelligence,Computer vision,ENCODE,Feature vector,Pattern recognition,Feature extraction,Speech recognition,Cascade,Boosting (machine learning)
Journal
Volume
Issue
ISSN
90
C
0167-6393
Citations 
PageRank 
References 
2
0.36
34
Authors
3
Name
Order
Citations
PageRank
Chao Sui1131.25
Roberto Togneri281448.33
M. Bennamoun33197167.23