Abstract | ||
---|---|---|
ABSTRACTThe Deep Video Understanding Challenge (DVUC) is aimed to use multiple modality information to build high-level understanding of video, involving tasks such as relationship recognition and interaction detection. In this paper, we use a joint learning framework to simultaneously predict multiple tasks with visual, text, audio and pose features. In addition, to answer the queries of DVUC, we design multiple answering strategies and use video language transformer which learns cross-modal information for matching videos with text choices. The final DVUC result shows that our method ranks first for group one of movie-level queries, and ranks third for both of group one and group two of scene-level queries. |
Year | DOI | Venue |
---|---|---|
2022 | 10.1145/3503161.3551600 | International Multimedia Conference |
DocType | Citations | PageRank |
Conference | 0 | 0.34 |
References | Authors | |
0 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Beibei Zhang | 1 | 0 | 0.34 |
Yaqun Fang | 2 | 0 | 0.68 |
Tongwei Ren | 3 | 328 | 30.22 |
Gangshan Wu | 4 | 275 | 36.63 |