Title
Semantic Tag Augmented XlanV Model for Video Captioning
Abstract
ABSTRACTThe key of video captioning is to leverage the cross-modal information from both vision and language perspectives. We propose to leverage the semantic tags to bridge the gap between these modalities rather than directly concatenating or attending to the visual and linguistic features as the previous works. The semantic tags are the object tags and the action tags detected in videos, which can be viewed as partial captions for the input video. To effectively exploit the semantic tags, we design a Semantic Tag augmented XlanV (ST-XlanV) model which encodes 4 kinds of visual and semantic features with X-Linear Attention based cross-attention modules. Moreover, tag related tasks are also designed in the pre-training stage to aid the model more fruitfully exploits the cross-modal information. The proposed model reaches the 5th place in the pre-training for video captioning challenge with the help of the semantic tags. Our codes will be available at: https://github.com/RubickH/ST-XlanV.
Year
DOI
Venue
2021
10.1145/3474085.3479228
International Multimedia Conference
DocType
Citations 
PageRank 
Conference
0
0.34
References 
Authors
0
5
Name
Order
Citations
PageRank
Yiqing Huang18523.41
Hongwei Xue201.35
Jiansheng Chen327331.28
Huimin Ma419729.49
Hongbing Ma500.68