Title
XlanV Model with Adaptively Multi-Modality Feature Fusing for Video Captioning
Abstract
The dynamic feature extracted by the 3D convolutional network and the static feature extracted by CNN are proved to be beneficial for video captioning. We adaptively fuse these two kinds of features in the X-Linear Attention Network Video and propose XlanV model for video captioning. However, we notice that the dynamic feature is not compatible with vision-language pre-training techniques when the frame length distribution and average pixel difference of training video and test video biases. Consequently, we directly train the XlanV model on the MSR-VTT dataset without pre-training on the GIF dataset in this challenge. The proposed XlanV model reaches the 1st place in the pre-training for video captioning challenge, which shows that substantially exploiting the dynamic feature is more effective than vision-language pre-training in this challenge.
Year
DOI
Venue
2020
10.1145/3394171.3416290
MM '20: The 28th ACM International Conference on Multimedia Seattle WA USA October, 2020
DocType
ISBN
Citations 
Conference
978-1-4503-7988-5
0
PageRank 
References 
Authors
0.34
0
4
Name
Order
Citations
PageRank
Yiqing Huang18523.41
Qiuyu Cai200.34
Siyu Xu300.34
Jiansheng Chen427331.28