Title
Enhancing the alignment between target words and corresponding frames for video captioning
Abstract
•Visual tags are introduced to bridge the gap between vision and language.•A textual-temporal attention model is devised and incorporated into the decoder to build exact alignment between target words and corresponding frames.•Extensive experiments on two well-known datasets, i.e., MSVD and MSR-VTT, demonstrate that our proposed approach achieves remarkable improvements over the state-of-the-art methods.
Year
DOI
Venue
2021
10.1016/j.patcog.2020.107702
Pattern Recognition
Keywords
DocType
Volume
Video captioning,Alignment,Visual tags,Textual-temporal attention
Journal
111
Issue
ISSN
Citations 
1
0031-3203
4
PageRank 
References 
Authors
0.39
0
5
Name
Order
Citations
PageRank
Yunbin Tu1302.85
Chang Zhou240.39
Junjun Guo354.47
Shengxiang Gao455.17
Zhengtao Yu546069.08