Title
Dynamic Spatio-Temporal Modular Network for Video Question Answering
Abstract
ABSTRACTVideo Question Answering (VideoQA) aims to understand given videos and questions comprehensively by generating correct answers. However, existing methods usually rely on end-to-end black-box deep neural networks to infer the answers, which significantly differs from human logic reasoning, thus lacking the ability to explain. Besides, the performances of existing methods tend to drop when answering compositional questions involving realistic scenarios. To tackle these challenges, we propose a Dynamic Spatio-Temporal Modular Network (DSTN) model, which utilizes a spatio-temporal modular network to simulate the compositional reasoning procedure of human beings. Concretely, we divide the task of answering a given question into a set of sub-tasks focusing on certain key concepts in questions and videos such as objects, actions, temporal orders, etc. Each sub-task can be solved with a separately designed module, e.g., spatial attention module, temporal attention module, logic module, and answer module. Then we dynamically assemble different modules assigned with different sub-tasks to generate a tree-structured spatio-temporal modular neural network for human-like reasoning before producing the final answer for the question. We carry out extensive experiments on the AGQA dataset to demonstrate our proposed DSTN model can significantly outperform several baseline methods in various settings. Moreover, we evaluate intermediate results and visualize each reasoning step to verify the rationality of different modules and the explainability of the proposed DSTN model.
Year
DOI
Venue
2022
10.1145/3503161.3548061
International Multimedia Conference
DocType
Citations 
PageRank 
Conference
0
0.34
References 
Authors
0
5
Name
Order
Citations
PageRank
Zi Qian100.34
Xin Wang213515.87
Xuguang Duan300.34
Chen Hong42111.66
Wenwu Zhu500.34