Title
Aggregated Multimodal Bidirectional Recurrent Model For Audiovisual Speech Recognition
Abstract
The Audiovisual Speech Recognition (AVSR) most commonly applied to multimodal learning employs both the video and audio information to do Robust Automatic Speech Recognition. Traditionally, AVSR was regarded as the inference and projection, a lot of restrictions on the ability of it. With the in-depth study, DNN becomes an important part of the toolkit in traditional classification tools, such as automatic speech recognition, image classification, natural language processing. AVSR often use some DNN models including Multimodal Deep Autoencoders (MDAEs), Multimodal Deep Belief Network (MDBN) and Multimodal Deep Boltzmann Machine (MDBM), which are always better than the traditional methods. However, such DNN models have several shortcomings: Firstly, they can't balance the modal fusion and temporal fusion, or even haven't temporal fusion; Secondly, the architecture of these models isn't end-to-end. In addition, the training and testing are cumbersome. We designed a DNN model-Aggregated Multimodal Bidirectional Recurrent Model (DILATE)-to overcome such weakness. The DILATE could be not just trained and tested simultaneously, but alternatively easy to train and prevent overfitting automatically. The experiments show that DILATE is superior to traditional methods and other DNN models in some benchmark datasets.
Year
DOI
Venue
2018
10.1007/978-3-030-00021-9_35
CLOUD COMPUTING AND SECURITY, PT VI
Keywords
Field
DocType
Multimodal deep learning, Audiovisual Speech Recognition
Boltzmann machine,Inference,Computer science,Deep belief network,Speech recognition,Overfitting,Contextual image classification,Multimodal learning,Modal
Conference
Volume
ISSN
Citations 
11068
0302-9743
0
PageRank 
References 
Authors
0.34
17
9
Name
Order
Citations
PageRank
Yu Wen100.34
Ke Yao200.34
Chunlin Tian301.35
Yao Wu4415.71
Zhongmin Zhang500.34
Yaning Shi600.34
Yin Tian700.34
Jin Yang800.34
Peiqi Wang9112.52