Title
Multimodal Analysis for Deep Video Understanding with Video Language Transformer
Abstract
ABSTRACTThe Deep Video Understanding Challenge (DVUC) is aimed to use multiple modality information to build high-level understanding of video, involving tasks such as relationship recognition and interaction detection. In this paper, we use a joint learning framework to simultaneously predict multiple tasks with visual, text, audio and pose features. In addition, to answer the queries of DVUC, we design multiple answering strategies and use video language transformer which learns cross-modal information for matching videos with text choices. The final DVUC result shows that our method ranks first for group one of movie-level queries, and ranks third for both of group one and group two of scene-level queries.
Year
DOI
Venue
2022
10.1145/3503161.3551600
International Multimedia Conference
DocType
Citations 
PageRank 
Conference
0
0.34
References 
Authors
0
4
Name
Order
Citations
PageRank
Beibei Zhang100.34
Yaqun Fang200.68
Tongwei Ren332830.22
Gangshan Wu427536.63