Multimodal Analysis for Deep Video Understanding with Video Language Transformer - Citegraph

Paper Info

Title
Multimodal Analysis for Deep Video Understanding with Video Language Transformer

Abstract
ABSTRACTThe Deep Video Understanding Challenge (DVUC) is aimed to use multiple modality information to build high-level understanding of video, involving tasks such as relationship recognition and interaction detection. In this paper, we use a joint learning framework to simultaneously predict multiple tasks with visual, text, audio and pose features. In addition, to answer the queries of DVUC, we design multiple answering strategies and use video language transformer which learns cross-modal information for matching videos with text choices. The final DVUC result shows that our method ranks first for group one of movie-level queries, and ranks third for both of group one and group two of scene-level queries.

Year	DOI	Venue
2022	10.1145/3503161.3551600	International Multimedia Conference
DocType	Citations	PageRank
Conference	0	0.34
References	Authors
0	4

Authors (4 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Beibei Zhang	1	0	0.34
Yaqun Fang	2	0	0.68
Tongwei Ren	3	328	30.22
Gangshan Wu	4	275	36.63

1