Semantic Tag Augmented XlanV Model for Video Captioning - Citegraph

Paper Info

Title
Semantic Tag Augmented XlanV Model for Video Captioning

Abstract
ABSTRACTThe key of video captioning is to leverage the cross-modal information from both vision and language perspectives. We propose to leverage the semantic tags to bridge the gap between these modalities rather than directly concatenating or attending to the visual and linguistic features as the previous works. The semantic tags are the object tags and the action tags detected in videos, which can be viewed as partial captions for the input video. To effectively exploit the semantic tags, we design a Semantic Tag augmented XlanV (ST-XlanV) model which encodes 4 kinds of visual and semantic features with X-Linear Attention based cross-attention modules. Moreover, tag related tasks are also designed in the pre-training stage to aid the model more fruitfully exploits the cross-modal information. The proposed model reaches the 5th place in the pre-training for video captioning challenge with the help of the semantic tags. Our codes will be available at: https://github.com/RubickH/ST-XlanV.

Year	DOI	Venue
2021	10.1145/3474085.3479228	International Multimedia Conference
DocType	Citations	PageRank
Conference	0	0.34
References	Authors
0	5

Authors (5 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Yiqing Huang	1	85	23.41
Hongwei Xue	2	0	1.35
Jiansheng Chen	3	273	31.28
Huimin Ma	4	197	29.49
Hongbing Ma	5	0	0.68

1