Title
End-To-End Audio Visual Scene-Aware Dialog Using Multimodal Attention-Based Video Features
Abstract
In order for machines interacting with the real world to have conversations with users about the objects and events around them, they need to understand dynamic audiovisual scenes. The recent revolution of neural network models allows us to combine various modules into a single end-to-end differentiable network. As a result, Audio Visual Scene-Aware Dialog (AVSD) systems for real-world applications can be developed by integrating state-of-the-art technologies from multiple research areas, including end-to-end dialog technologies, visual question answering (VQA) technologies, and video description technologies. In this paper, we introduce a new data set of dialogs about videos of human behaviors, as well as an end-to-end Audio Visual Scene-Aware Dialog (AVSD) model, trained using this new data set, that generates responses in a dialog about a video. By using features that were developed for multimodal attention-based video description, our system improves the quality of generated dialog about dynamic video scenes.
Year
Venue
Keywords
2018
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)
Audio visual scene-aware dialog, Visual QA, Video description, End-to-end modeling
Field
DocType
Volume
Dialog box,Mel-frequency cepstrum,Question answering,Pattern recognition,Computer science,Visualization,Speech recognition,Feature extraction,Human behavior,Artificial intelligence,Artificial neural network,Encoding (memory)
Journal
abs/1806.08409
ISSN
Citations 
PageRank 
1520-6149
5
0.46
References 
Authors
8
13
Name
Order
Citations
PageRank
Chiori Hori143961.06
Huda AlAmri281.21
Jue Wang3183.78
Gordon Wichern49314.97
Takaaki Hori540845.58
Anoop Cherian623120.90
Tim K. Marks728119.41
Vincent Cartillier871.22
Raphael Gontijo Lopes9183.34
Abhishek Das1043323.54
Irfan A. Essa114876580.85
Dhruv Batra122142104.81
Devi Parikh132929132.01