Enhancing the alignment between target words and corresponding frames for video captioning - Citegraph

Paper Info

Title
Enhancing the alignment between target words and corresponding frames for video captioning

Abstract
•Visual tags are introduced to bridge the gap between vision and language.•A textual-temporal attention model is devised and incorporated into the decoder to build exact alignment between target words and corresponding frames.•Extensive experiments on two well-known datasets, i.e., MSVD and MSR-VTT, demonstrate that our proposed approach achieves remarkable improvements over the state-of-the-art methods.

Year	DOI	Venue
2021	10.1016/j.patcog.2020.107702	Pattern Recognition
Keywords	DocType	Volume
Video captioning,Alignment,Visual tags,Textual-temporal attention	Journal	111
Issue	ISSN	Citations
1	0031-3203	4
PageRank	References	Authors
0.39	0	5

Authors (5 rows)

Cited by (4 rows)

References (0 rows)

Name	Order	Citations	PageRank
Yunbin Tu	1	30	2.85
Chang Zhou	2	4	0.39
Junjun Guo	3	5	4.47
Shengxiang Gao	4	5	5.17
Zhengtao Yu	5	460	69.08

1